↓ Archives ↓

Category → Slides

MTTR is more important than MTBF (for most types of F)

This week I gave a talk at QCon SF about development and operations cooperation at Etsy and Flickr.  It’s a refresh of talks I’ve given in the past, with more detail about how it’s going at Etsy. (It’s going excellently :) )

There’s a bunch of topics in the presentation slides, all centered around roles, responsibilities, and intersection points of domain expertise commonly found in development and operations teams. One of the not-groundbreaking ideas that I’m finally getting down is something that should be evident for anyone practicing or interested in ‘continuous deployment’:

Being able to recover quickly from failure is more important than having failures less often.

This has what should be an obvious caveat: some types of failures shouldn’t ever happen, and not all failures/degradations/outages are the same. (like failures resulting in accidental data loss, for example)

Put another way:

MTTR is more important than MTBF

(for most types of F)

(Edited: I did say originally “MTTR > MTBF”)

What I’m definitely not saying is that failure should be an acceptable condition. I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure than trying to prevent it. I agree with Hammond, when he said:

If you think you can prevent failure, then you aren’t developing your ability to respond.

In a complete steal of Artur Bergman‘s material, an example in the slides of the talk is of the Jeep versus Rolls Royce:

Jeep versus Rolls Artur has a Jeep, and he’s right when he says that for the most part, Jeeps are built with optimizing Mean-Time-To-Repair, not the classical approach to automotive engineering, which is to optimize Mean-Time-Between-Failures. This is likely because Jeep owners have been beating the shit out of their vehicles for decades, and every now and again, they expect that abuse to break something. Jeep designers know this, which is why it’s so damn easy to repair. Nuts and bolts are easy to reach, tools are included when you buy the thing, and if you haven’t seen the video of Army personnel disassembling and reassembling a Jeep in under 4 minutes, you’re missing out.

The Rolls Royce, on the other hand, likely don’t have such adventurous owners, and when it does break down, it’s a fine and acceptable thing for the car to be out of service for a long and expensive fixing by the manufacturer.

We as web operations folks want our architectures to be built optimized for MTTR, not for MTBF. I think that the reasons should be obvious, and the fact that practices like:

  • Dark launching
  • Percentage-based production A/B rollouts
  • Feature flags

are becoming commonplace should verify this approach as having legs.

The slides from QConSF are here:

Go or No-Go: Operability and Contingency Planning (Surge)

Last month I had the honor of speaking at the Surge Conference in Baltimore, put together by OmniTI.

It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I haven’t uploaded those slides yet. The talk I gave on the second day of the conference was about how we plan for feature launches at Etsy, which follows a similar pattern we had at Flickr.

So, here are the slides for that talk: