DevOps Metrics and Measurement

Overview

Your organization is now committed to DevOps methodology. You’ve integrated development, operations, and QA, and you have moved drastically away from waterfall and into a fire hose of application development. How do you find out how well it works? How do you know if it’s working at all? Many net new DevOps organizations are surprised to find that it was easier to start then they thought, but hard to keep alive. This can be because of menacing habits, or just because the team is too far into the weeds. In a word, you need metrics.

Read More …

You need metrics not just as a way to measure the success (or lack of success) of your DevOps program, but also as a way to find out how it can be improved, modified, or extended. Without metrics, you’re flying blind. With metrics, you have a holistic point of view, you know where you are, where you’re going, where you can go, and how to get there. But I am not talking about analytics tools which measure the activities withing the pipeline. I’m talking about measuring the pipeline itself.

What kinds of factors do DevOps metrics measure? For the most part, they measure such things as the speed of development, deployment, and customer response, frequency of deployments and failures, repair time, volume of repair request, and the rate of change in these indicators. The focus of DevOps metrics tends to be on deployment, operations, and support (as opposed to design and early-stage development, for example), since most of the ongoing effort associated with DevOps is in these areas.

Read Less …

Categories

You’ve probably come to realize that DevOps is treated differently from company to company. That’s partly because of its ability to scale for both large and small organizations. But there are a couple of best practices that can be applied universally:

People

People are an intrinsic part of any DevOps process. People-oriented metrics measure such things as turnover, capability, and response time. Always start with people. They are the hardest element of any element, and their influence is sometimes hard to spot.

Process

In some ways, DevOps is all about process — the continual deployment/operations/support cycle is an ongoing suite of interwoven processes. But some metrics are more clearly process-oriented than others, particularly those involving continuous delivery, response, and repair. Development-to-deployment lead time, for example, is a largely process-oriented metric, as are deployment frequency and response time. Process metrics can be a measure of speed (Where are the bottlenecks, and is the process itself a bottleneck?), appropriateness (Are all steps relevant?), effectiveness (Does it get the job done?), or efficiency (Are the steps in the optimum sequence? is there a smooth flow within the process?).

Technology

Technology metrics also play a major role in DevOps, measuring such things as uptime (What percentage of the time is the system running? What about the network, and support applications?) and failure rate (What is the percentage of failed deployments, changes, or units?).

Types of Metrics

DevOps is all about continuous delivery and shipping code as fast as possible. You want to move fast and not break things. By tracking these DevOps metrics, you can evaluate just how fast you can move before you start breaking things.

Deployment frequency

Tracking how often you do deployments is a good DevOps metric. Ultimately, the goal is to do more smaller deployments as often as possible. Reducing the size of deployments makes it easier to test and release.

I would suggest counting both production and non-production deployments separately. How often you deploy to QA or pre-production environments is also important. You need to deploy early and often in QA to ensure time for testing. Finding bugs in QA is important to keep your defect escape rate down.

Deployment time

This might seem like a weird one, but tracking how long it takes to do an actual deployment is another good metric. One of our applications at Stackify is deployed with Azure worker roles and it takes about an hour to deploy. It is a nightmare. Tracking such things could help identify potential problems. It is much easier to deploy more often when the task of actually doing it is quick.

Lead time

If the goal is shipping code quickly, this is a really key DevOps metric. I would define lead time as the amount of time that occurs between starting on a work item until it is deployed. This helps you know that if you started on a new work item today, how long would it take on average until it gets to production. This is also a good metric to help with BizDevOps.

Customer tickets

The best and worst indicator of application problems is customer support tickets and feedback. The last thing you want is for your users to find bugs or have problems with your software. Because of this, they also make a good indicator of application quality and performance problems.

Automated tests pass %

To increase velocity, it is highly recommended that your team makes extensive usage of unit and functional testing. Since DevOps relies heavily on automation, tracking how well your automated tests work is a good DevOps metrics. It is good to know how often code changes are causing your tests to break.

Defect escape rate

Do you know how many software defects are being found in production versus QA? If you want to ship code fast, you need to have confidence that you can find software defects before they get to production. Your defect escape rate is a great DevOps metric to track how often those defects make it to production.

Availability

The last thing you ever want is for your application to be down. Depending on your type of application and how you deploy it, you may have a little downtime as part of scheduled maintenance. I would suggest tracking that and all unplanned outages.

Service level agreements

Most companies have some service level agreement (SLA) that they operate with. It is also important that you track your compliance with your SLAs. Even if there are no formal SLA, there probably are application requirements or expectations to be achieved.

Failed deployments

We all hope this never happens, but how often do your deployments cause an outage or major issues for your users? Reversing a failed deployment is something we never want to do, but it is something you should always plan for. If you have issues with failed deployments, be sure to track this metric over time. This could also be seen as tracking mean time to failure (MTTF).

Tracking error rates within your application

Tracking error rates within your application is super important. Not only are they an indicator of quality problems, but also ongoing performance and uptime related issues. Good exception handling best practices are critical for good software.

Bugs – Identify new exceptions being thrown in your code after a deployment
Production issues – Capture issues with database connections, query timeouts, and other related issues.
Errors are a fact of life for most applications. At Stackify, we process millions of messages an hour across a couple hundred servers and over a thousand SQL databases. A few errors here and there are just part of the noise of a busy system. It is important that you keep a pulse on your error rates and look for spikes.

Application usage & traffic

After a deployment, you want to see if the amount of transactions or users accessing your system looks normal. If you suddenly have no traffic or a giant spike in traffic, something could be wrong.

The last thing you ever want to see is no traffic at all. You could also see a spike in traffic if you are using microservices and one of your applications is causing a lot more traffic all of a sudden.

Application performance

Before you even do a deployment, you should use a tool like Retrace to look for performance problems, hidden errors, and other issues. During and after the deployment, you should also look for any changes in overall application performance.

It might be common after a deployment to see major changes in the usage of specific SQL queries, web service calls, and other application dependencies. Tools like Retrace can provide valuable visualizations like this one below that helps make it easy to spot problems.

Mean time to detection (MTTD)

When problems do happen, it is important that you identify them quickly. The last thing you want is to have a major partial or broad system outage and not know about it. Having robust application monitoring and good coverage in place will help you detect issues quickly. Once you detect them, you also have to fix them quickly!

Mean time to recovery (MTTR)

This metric helps you track how long it takes to recover from failures. A key metric for the business is keeping failures to a minimum and being able to recover from them quickly. It is typically measured in hours and may refer to business hours, not clock hours.

Having good application monitoring tools in place to quickly identify issues and quickly deploy the fix is important to reducing your MTTR.

Application metrics

Beyond the DevOps metrics listed above, there are dozens of other metrics you can track that are specific to your applications. Most of them are not necessarily relevant to DevOps in regards to deploying your application. However, they are very critical for monitoring the usage and performance of your applications in production.

For example, at Stackify, we use custom metrics to track how many log messages are received via our API per minute. This is a critical metric that helps us understand the volume of data flowing through our system. Depending on your application, you may have similar custom metrics that are critical to your application.