Reducing MTTR is not enough

Relationship between MTTR and MATR.

Reducing MTTR is a hot topic among DevOps practitioners.  MTTR measures average time for a cycle: problem occurrence, detection, response, and repair. Reducing the MTTR should greatly improve service quality right? Well, not exactly? The metric we should be looking at is this: what is the available time for repair (MATR — maximum available time to repair) given a particular class of a problem. I define MATR as the maximum time a problem can be allowed to persist before impacting service quality in any measurable way.

Let’s examine a few over simplified examples:

Example1: 100 web server farm, 1 goes down — there may not be a measurable impact on service quality and MATR can be fairly long.

Example2: 100 web server farm connected to single DB server. DB server is having a performance slow down, all 100 web servers and end-users are affected and MATR is close to 0. Incremental reduction in MTTR does not avoid service disruption in any meaningful way.

What we should be asking is this: given a service/application A, and class of a problem P, what is its MATR? Mathematically we can write it as a function: MATR(A, P) where A is architecture/service, P — problem class. In a perfect world we want to attain MTTR(A, P) <= MATR(A, P), anything more we would call CRISIS.

Crisis is when MTTR(A, P) > MATR(A, P) in which case a given (A, P) pair impacts service quality in a meaningful way. Time in crisis defined as MTTR-MATR, where MTTR > MATR. Outcome of this is that we have 2 dials now. We can avoid crisis by either reducing MTTR and/or increasing MATR or both. MTTR is a dimension most affected by Ops teams. Ops can reduce MTTR by investing in tools, developing experience and expertise, automation, process.

How does one increase MATR? The answer is: building resilient, elastic architectures. MATR is the dimension most affected by the Dev teams. There is a theoretical limit below which MTTR can’t go. It is also possible that MATR for certain types of application is always very close to 0, meaning every problem turns to crisis no matter how much reduction in MTTR is achieved. MTTR always measures a cycle of failure occurrence, detection, response, repair, and therefore can never be zero.

Any incremental reduction MTTR requires significant amount of investment either in tools, people, process or all.

Focusing only on MTTR reduction could be futile and very expensive in environments where MATR(A, P) is thin — meaning almost every problem evolves into a crisis. The important note on MATR is that there is no theoretical upper limit on MATR; the higher it is the bigger is the gap between MTTR and MATR, the less likely any problem turns into crisis. What we should strive to achieve is the gap between MTTR and MATR. The bigger the gap the more time DevOps teams have to respond to a problem and less impact any problem has on service quality.

I’ve yet to meet Ops who enjoy waking up at 3am to work on a production problem. They do this every time MTTR exceeds MATR and will continue to do so until the equation reverses.

Bottom line: the biggest focal for DevOps teams should be on building and operating resilient and elastic architectures that embrace and tolerate failures. Such architectures inflate MATR(A, P) and that alone reduces the impact of any failure and subsequent MTTR has on the service quality.