We all have experienced how annoying it can be when a web site is unavailable – you cannot book your travel, stream your media or purchase an item. We all expect 24×7 cover and completely rely on services to be aways available.
But being a provider of services, this can be a difficult task to achieve. Having worked in this space for almost ten years, delivering a large number of services to the University of Edinburgh , including top priority services, this has been an ongoing challenge. In this blog I would like to share some of my experience and solutions.
Infrastructure monitoring
The traditional approach by technical teams delivering services, is to implement monitoring to all components a service relies on. A traditional web based application normally uses the following components:
- Web server
- Database server
- maybe a file server storing files
The obvious technical monitoring may cover:
- Server monitoring (CPU, RAM, storage)
- Ping servers
- Database monitoring (blocks/locks, Worst SQL, …)
While this technical monitoring will deliver useful information for technical staff investigating availability route causes, it will most likely not establish a reliable way to spot when users experience service availability issues. Of course if a server no longer responds to requests, the technical monitoring will spot this and probably mark the service as unavailable. But services are much more complex and there are many more components which may cause unavailability to end users. For example we use a load balancer in front of the web server, an authentication step may be involved which requires other infrastructure, DNS or firewall is likely involved, etc. .
The bottom up approach of the technical monitoring will almost always not be able to predict the user experience.
User journey monitoring
The opposite approach to the technical bottom up approach is the top down approach by recreating and monitoring the user journey. If we are able to regularly test a given user journey and check if a given user journey does not encounter any issues, we should be able to establish if a service is up or down from an end user perspective.
There are a few things we need to be aware when choosing a user journey to be monitored:
- Most applications provide many user journeys and it may be difficult and costly to monitor each user journey. One or a few user journeys should be established which are a good representation of a typical user journey.
- A user journey should involve all major technical components of an application. For example if an application uses a web and database tier, both tiers should somehow be involved in the user journey we monitor.
- Components which are not directly involved in the application but users may require to access the application should be taken into account. For example a single sign on solution.
Setting up user journey monitoring
Apart from the considerations mentioned in the above paragraph, there are a few more items to think as part of setting up user journey based monitoring:
- Frequency of checking the monitoring. Ideally the user journey is monitored regularly to ensure issues are picked up as quickly as possible. There may be costs involved in setting this up and a too high frequency could be costly. We monitor every 10 minutes.
- An account will have to be used to perform the user journey checks. Choose an account which represents a user and ensure the account is setup with real user details. You should not use a real user account as this will likely breach user privacy.
- Ensure you are able to check that the different steps in the user journey return pages you expect. Just checking for a web application that html is returned is likely not sufficient. For example an internal server error may still return valid html. You may have to check for specific words to be present or not present.
- Set reasonable timeouts and response times. Each step in your journey ill take a certain time to process and this will differ from application to application. Response times should be set not to create unnecessary alerts but at the same time to be sensitive enough to pick up when the service is performing poor.
How we implemented user journeys
There are many tools from different venders – some are free and some cost. I will not go into a review of tools in this blog. While we use site24x7.com but there are many other tools which provide excellent user journey monitoring.
We monitor around 80 services with mainly one typical user journey for each service. All our top priority services are monitored and many of the medium and lower priority services are monitored as well.
Dashboards
On top of the out of the box functionality site24x7.com provides, we created a few dashboards which helps us manage our services:
Top priority availability page. This page shows the University’s top priority service status and is integrated with the Information Services status and alerts:
http://www.ed.ac.uk/information-services/services/status-alerts
Service availability dashboard – mainly used by Production Management’s operator to alert on any service issues to be proactive and pick up issues before users do:
I will go into more details what we monitor and what reports we get in a later blog.