↓ Archives ↓

The Infinite Hows (or, the Dangers Of The Five Whys)

(this is also posted on O’Reilly’s Radar blog)

Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using it.

This critique I present is hardly original; most of this post is inspired by Todd Conklin, Sidney Dekker, and Nancy Leveson.

The concept of post-hoc explanation (or “postmortems” as they’re commonly known) is, at this point, taken hold in the web engineering and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through those lenses.

I think that this is good, because my intent has always been (might always be) to help translate concepts from one domain to another. In order to do this effectively, we need to know also what to discard (or at least inspect critically) from those other domains.

The Five Whys is such an approach that I think we should discard.

This post explains my reasoning for discarding it, and how using it has the potential to be harmful, not helpful, to an organization. Here’s how I intend on doing this: I’m first going to talk about what I think are deficiencies in the approach, suggest an alternative, and then ask you to simply try the alternative yourself.

Here is the “bottom line, up front” gist of my assertions:

“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking  “how?

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.

Asking a chain of “why?” assumes too much about the questioner’s choices, and assumes too much about each answer you get. At best, it locks you into a causal chain, which is not how the world actually works. This is a construction that ignores a huge amount of complexity in an event, and it’s the complexity that we want to explore if we have any hope of learning anything.

But It’s A Great Way To Get People Started!

The most compelling argument to using the Five Whys is that it’s a good first step towards doing real “root cause analysis” – my response to that is twofold:

  1. “Root Cause Analysis*” isn’t what you should be doing anyway, and
  2. It’s only a good “first step” because it’s easy to explain and understand, which makes it easy to socialize. The issue with this is that the concepts that the Five Whys depend on are not only faulty, but can be dangerous for an organization to embrace.

If the goal is learning (and it should be) then using a method of retrospective learning should be confident in how it’s bringing to light data that can be turned into actionable information. The issue with the Five Whys is that it’s tunnel-visioned into a linear and simplistic explanation of how work gets done and events transpire. This narrowing can be incredibly problematic.

In the best case, it can lead an organization to think they’re improving on something (or preventing future occurrences of events) when they’re not.

In the worst case, it can re-affirm a faulty worldview of causal simplification and set up a structure where individuals don’t feel safe in giving their narratives because either they weren’t asked the right “why?” question or the answer that a given question pointed to ‘human error’ or individual attributes as causal.

Let’s take an example. From my tutorials at the Velocity Conference in New York, I used an often-repeated straw man to illustrate this:

Screen Shot 2014-11-12 at 3.45.24 PM

This is the example of the Five Whys found in the Web Operations book, as well.

This causal chain effectively ends with a person’s individual attributes, not with a description of the multiple conditions that allow an event like this to happen. Let’s look into some of the answers…

“Why did the server fail? Because an obscure subsystem was used in the wrong way.”

This answer is dependent on the outcome. We know that it was used in the “wrong” way only because it we’ve connected it to the resulting failure. In other words, we as “investigators” have the benefit of hindsight. We can easily judge the usage of the server because we know the outcome. If we were to go back in time and ask the engineer(s) who were using it: “Do you think that you’re doing this right?” they would answer: yes, they are. We want to know what are the various influences that brought them to think that, which simply won’t fit into the answer of “why?”

The answer also limits the next question that we’d ask. There isn’t any room in the dialogue to discuss things such as the potential to use a server in the wrong way and it not result in failure, or what ‘wrong’ means in this context. Can the server only be used in two ways – the ‘right’ way or the ‘wrong’ way? And does success (or, the absence of a failure) dictate which of those ways it was used? We don’t get to these crucial questions.

“Why was it used in the wrong way? The engineer who used it didn’t know how to use it properly.”

This answer is effectively a tautology, and includes a post-hoc judgement. It doesn’t tell us anything about how the engineer did use the system, which would provide a rich source of operational data, especially for engineers who might be expected to work with the system in the future. Is it really just about this one engineer? Or is it possibly about the environment (tools, dashboards, controls, tests, etc.) that the engineer is working in? If it’s the latter, how does that get captured in the Five Whys?

So what do we find in this chain we have constructed above? We find:

  • an engineer with faulty (or at least incomplete) knowledge
  • insufficient indoctrination of engineers
  • a manager who fouls things up by not being thorough enough in the training of new engineers (indeed: we can make a post-hoc judgement about her beliefs)

If this is to be taken as an example of the Five Whys, then as an engineer or engineering manager, I might not look forward to it, since it focuses on our individual attributes and doesn’t tell us much about the event other than the platitude that training (and convincing people about training) is important.

These are largely answers about “who?” not descriptions of what conditions existed. In other words, by asking “why?” in this way, we’re using failures to explain failures, which isn’t helpful.

If we ask: “Why did a particular server fail?” we can get any number of answers, but one of those answers will be used as the primary way of getting at the next “why?” step. We’ll also lose out on a huge amount of important detail, because remember: you only get one question before the next step.

If instead, we were to ask the engineers how they went about implementing some new code (or ‘subsystem’), we might hear a number of things, like maybe:

  • the approach(es) they took when writing the code
  • what ways they gained confidence (tests, code reviews, etc.) that the code was going to work in the way they expected it before it was deployed
  • what (if any) history of success or failure have they had with similar pieces of code?
  • what trade-offs they made or managed in the design of the new function?
  • how they judged the scope of the project
  • how much (and in what ways) they experienced time pressure for the project
  • the list can go on, if you’re willing to ask more and they’re willing to give more

Rather than judging people for not doing what they should have done, the new view presents tools for explaining why people did what they did. Human error becomes a starting point, not a conclusion. (Dekker, 2009)

When we ask “how?”, we’re asking for a narrative. A story.

In these stories, we get to understand how people work. By going with the “engineer was deficient, needs training, manager needs to be told to train” approach, we might not have a place to ask questions aimed at recommendations for the future, such as:

  • What might we put in place so that it’s very difficult to put that code into production accidentally?
  • What sources of confidence for engineers could we augment?

As part of those stories, we’re looking to understand people’s local rationality. When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn’t have done it.

WhyHow.001

Again, I’m not original with this thought. Local rationality (or as Herb Simon called it, “bounded rationality”) is something that sits firmly atop some decades of cognitive science.

These stories we’re looking for contain details that we can pull on and ask more about, which is critical as a facilitator of a post-mortem debriefing, because people don’t always know what details are important. As you’ll see later in this post, reality doesn’t work like a DVR; you can’t pause, rewind and fast-forward at will along a singular and objective axis, picking up all of the pieces along the way, acting like CSI. Memories are faulty and perspectives are limited, so a different approach is necessary.

Not just “how”

In order to get at these narratives, you need to dig for second stories. Asking “why?” will get you an answer to first stories. These are not only insufficient answers, they can be very damaging to an organization, depending on the context. As a refresher…

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 

Now, read again the straw-man example of the Five Whys above. The questions that we ask frame the answers that we will get in the form of first stories. When we ask more and better questions (such as “how?”) we have a chance at getting at second stories.

You might wonder: how did I get from the Five Whys to the topic of ‘human error’? Because once ‘human error’ is a candidate to reach for as a cause (and it will, because it’s a simple and potentially satisfying answer to “why?”) then you will undoubtedly use it.

At the beginning of my tutorial in New York, I asked the audience this question:

IsThisRight.001

At the beginning of the talk, a large number of people said yes, this is correct. Steven Shorrock (who is speaking at Velocity next week in Barcelona on this exact topic) has written a great article on this way of thinking: If It Weren’t For The People. By the end of my talk, I was able to convince them that this is also the wrong focus of a post-mortem description.

This idea accompanies the Five Whys more often than not, and there are two things that I’d like to shine some light on about it:

Myth of the “human or technical failure” dichotomy

This is dualistic thinking, and I don’t have much to add to this other than what Dekker has said about it (Dekker, 2006):

“Was the accident caused by mechanical failure or by human error? It is a stock question in the immediate aftermath of a mishap. Indeed, it seems such a simple, innocent question. To many it is a normal question to ask: If you have had an accident, it makes sense to find out what broke. The question, however, embodies a particular understanding of how accidents occur, and it risks confining our causal analysis to that understanding. It lodges us into a fixed interpretative repertoire. Escaping from this repertoire may be difficult. It sets out the questions we ask, provides the leads we pursue and the clues we examine, and determines the conclusions we will eventually draw.”

Myth: during a retrospective investigation, something is waiting to be “found”

I’ll cut to the chase: there is nothing waiting to be found, or “revealed.” These “causes” that we’re thinking we’re “finding”? We’re constructing them, not finding them. We’re constructing them because we are the ones that are choosing where (and when) to start asking questions, and where/when to stop asking the questions. We’ve “found” a root cause when we stop looking. And in many cases, we’ll get lazy and just chalk it up to “human error.”

As Erik Hollnagel has said (Hollnagel, 2009, p. 85):

“In accident investigation, as in most other human endeavours, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually find (What-You-Find).”

More to the point: “What-You-Look-For-Is-What-You-Fix”

We think there is something like the cause of a mishap (sometimes we call it the root cause, or primary cause), and if we look in the rubble hard enough, we will find it there. The reality is that there is no such thing as the cause, or primary cause or root cause . Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in. (Dekker, 2006)

Nancy Leveson comments on this in her excellent book Engineering a Safer World this idea (p.20):

Subjectivity in Selecting Events

The selection of events to include in an event chain is dependent on the stopping rule used to determine how far back the sequence of explanatory events goes. Although the first event in the chain is often labeled the ‘initiating event’ or ‘root cause’ the selection of an initiating event is arbitrary and previous events could always be added.

Sometimes the initiating event is selected (the backward chaining stops) because it represents a type of event that is familiar and thus acceptable as an explanation for the accident or it is a deviation from a standard [166]. In other cases, the initiating event or root cause is chosen because it is the first event in the backward chain for which it is felt that something can be done for correction.

The backward chaining may also stop because the causal path disappears due to lack of information. Rasmussen suggests that a practical explanation for why actions by operators actively involved in the dynamic flow of events are so often identified as the cause of an accident is the difficulty in continuing the backtracking “through” a human [166].

A final reason why a “root cause” may be selected is that it is politically acceptable as the identified cause. Other events or explanations may be excluded or not examined in depth because they raise issues that are embarrassing to the organization or its contractors or are politically unacceptable.

Learning is the goal. Any prevention depends on that learning.

So if not the Five Whys, then what should you do? What method should you take?

I’d like to suggest an alternative, which is to first accept the idea that you have to actively seek out and protect the stories from bias (and judgement) when you ask people “how?”-style questions. Then you can:

  • Ask people for their story without any replay of data that would supposedly ‘refresh’ their memory
  • Tell their story back to them and confirm you got their narrative correct
  • Identify critical junctures
  • Progressively probe and re-build how the world looked to people inside of the situation at each juncture.

As a starting point for those probing questions, we can look to Gary Klein and Sidney Dekker for the types of questions you can ask instead of “why?”…

Debriefing Facilitation Prompts

(from The Field Guide To Understanding Human Error, by Sidney Dekker)

At each juncture in the sequence of events (if that is how you want to structure this part of the accident story), you want to get to know:

  • Which cues were observed (what did he or she notice/see or did not notice what he or she had expected to notice?)
  • What knowledge was used to deal with the situation? Did participants have any experience with similar situations that was useful in dealing with this one?
  • What expectations did participants have about how things were going to develop, and what options did they think they have to influence the course of events?
  • How did other influences (operational or organizational) help determine how they interpreted the situation and how they would act?

Here are some questions Gary Klein and his researchers typically ask to find out how the situation looked to people on the inside at each of the critical junctures:

Cues What were you seeing?

What were you focused on?

What were you expecting to happen?

Interpretation If you had to describe the situation to your colleague at that point, what would you have told?
Errors What mistakes (for example in interpretation) were likely at this point?
Previous knowledge/experience

Were you reminded of any previous experience?

Did this situation fit a standard scenario?

Were you trained to deal with this situation?

Were there any rules that applied clearly here?

Did any other sources of knowledge suggest what to do?

Goals What were you trying to achieve?Were there multiple goals at the same time?Was there time pressure or other limitations on what you could do?
Taking Action How did you judge you could influence the course of events?

Did you discuss or mentally imagine a number of options or did you know straight away what to do?

Outcome Did the outcome fit your expectation?
Did you have to update your assessment of the situation?
Communications What communication medium(s) did you prefer to use? (phone, chat, email, video conf, etc.?)

Did you make use of more than one communication channels at once?

Help

Did you ask anyone for help?

What signal brought you to ask for support or assistance?

Were you able to contact the people you needed to contact?

For the tutorials I did at Velocity, I made a one-pager of these: http://bit.ly/DebriefingPrompts

Screen Shot 2014-11-12 at 4.03.30 PM

Try It

I have tried to outline some of my reasoning on why using the Five Whys approach is suboptimal, and I’ve given an alternative. I’ll do one better and link you to the tutorials that I gave in New York in October, which I think digs deeper into these concepts. This is in four parts, 45 minutes each.

Part I – Introduction and the scientific basis for post-hoc restrospective pitfalls and learning

Part II – The language of debriefings, causality, case studies, teams coping with complexity

Part III – Dynamic fault management, debriefing prompts, gathering and contextualizing data, constructing causes

Part IV – Taylorism, normal work, ‘root cause’ of software bugs in cars, Q&A

My request is that the next time that you would do a Five Whys, that you instead ask “how?” or the variations of the questions I posted above. If you think you get more operational data from a Five Whys and are happy with it, rock on.

If you’re more interested in this alternative and the fundamentals behind it, then there are a number of sources you can look to. You could do a lot worse than starting with Sidney Dekker’s Field Guide To Understanding Human Error.

An Explanation

For those readers who think I’m too unnecessarily harsh on the Five Whys approach, I think it’s worthwhile to explain why I feel so strongly about this.

Retrospective understanding of accidents and events is important because how we make sense of the past greatly and almost invisibly influences our future. At some point in the not-so-distant past, the domain of web engineering was about selling books online and making a directory of the web. These organizations and the individuals who built them quickly gave way to organizations that now build cars, spacecraft, trains, aircraft, medical monitoring devices…the list goes on…simply because software development and distributed systems architectures are at the core of modern life.

The software worlds and the non-software worlds have collided and will continue to do so. More and more “life-critical” equipment and products rely on software and even the Internet.

Those domains have had varied success in retrospective understanding of surprising events, to say the least. Investigative approaches that are firmly based on causal oversimplification and the “Bad Apple Theory” of deficient individual attributes (like the Five Whys) have shown to not only be unhelpful, but objectively made learning harder, not easier. As a result, people who have made mistakes or involved in accidents have been fired, banned from their profession, and thrown in jail for some of the very things that you could find in a Five Whys.

I sometimes feel nervous that these oversimplifications will still be around when my daughter and son are older. If they were to make a mistake, would they be blamed as a cause? I strongly believe that we can leave these old ways behind us and do much better.

My goal is not to vilify an approach, but to state explicitly that if the world is to become safer, then we have to eschew this simplicity; it will only get better if we embrace the complexity, not ignore it.

 

Epilogue: The Longer Version For Those Who Have The Stomach For Complexity Theory

The Five Whys approach follows a Newtonian-Cartesian worldview. This is a worldview that is seductively satisfying and compellingly simple. But it’s also false in the world we live in.

What do I mean by this?

There are five areas why the Five Whys firmly sits in a Newtonian-Cartesian worldview that we should eschew when it comes to learning from past events. This is a Cliff Notes version of “The complexity of failure: Implications of complexity theory for safety investigations” –

First, it is reductionist. The narrative built by the Five Whys sits on the idea that if you can construct a causal chain, then you’ll have something to work with. In other words: to understand the system, you pull it apart into its constituent parts. Know how the parts interact, and you know the system.

Second, it assumes what Dekker has called “cause-effect symmetry” (Dekker, complexity of failure):

“In the Newtonian vision of the world, everything that happens has a definitive, identifiable cause and a definitive effect. There is symmetry between cause and effect (they are equal but opposite). The determination of the ‘‘cause’’ or ‘‘causes’’ is of course seen as the most important function of accident investigation, but assumes that physical effects can be traced back to physical causes (or a chain of causes-effects) (Leveson, 2002). The assumption that effects cannot occur without specific causes influences legal reasoning in the wake of accidents too. For example, to raise a question of negligence in an accident, harm must be caused by the negligent action (GAIN, 2004). Assumptions about cause-effect symmetry can be seen in what is known as the outcome bias (Fischhoff, 1975). The worse the consequences, the more any preceding acts are seen as blameworthy (Hugh and Dekker, 2009).”

John Carroll (Carroll, 1995) called this “root cause seduction”:

The identification of a root cause means that the analysis has found the source of the event and so everyone can focus on fixing the problem.  This satisfies people’s need to avoid ambiguous situations in which one lacks essential information to make a decision (Frisch & Baron, 1988) or experiences a salient knowledge gap (Loewenstein, 1993). The seductiveness of singular root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know (Fischhoff,Slovic,& Lichtenstein, 1977).

That last bit about a tendency to be overconfident about how much we know (in this context, how much we know about the past) is a strong piece of research put forth by Baruch Fischhoff, who originally researched what we now understand to be the Hindsight Bias. Not unsurprisingly, Fischhoff’s doctoral thesis advisor was Daniel Kahneman (you’ve likely heard of him as the author of Thinking Fast and Slow), whose research in cognitive biases and heuristics everyone should at least be vaguely familiar with.

The third issue with this worldview, supported by the idea of Five Whys and something that follows logically from the earlier points is that outcomes are foreseeable if you know the initial conditions and the rules that govern the system. The reason that you would even construct a serial causal chain like this is because

The fourth part of this is that time is irreversible. We can’t look to a causal chain as something that you can fast-forward and rewind, no matter how attractively simple that seems. This is because the socio-technical systems that we work on and work in are complex in nature, and are dynamic. Deterministic behavior (or, at least predictability) is something that we look for in software; in complex systems this is a foolhardy search because emergence is a property of this complexity.

And finally, there is an underlying assumption that complete knowledge is attainable. In other words: we only have to try hard enough to understand exactly what happened. The issue with this is that success and failure have many contributing causes, and there is no comprehensive and objective account. The best that you can do is to probe people’s perspectives at juncture points in the investigation. It is not possible to understand past events in any way that can be considered comprehensive.

Dekker (Dekker, 2011):

As soon as an outcome has happened, whatever past events can be said to have led up to it, undergo a whole range of transformations (Fischhoff and Beyth, 1975; Hugh and Dekker, 2009). Take the idea that it is a sequence of events that precedes an accident. Who makes the selection of the ‘‘events’’ and on the basis of what? The very act of separating important or contributory events from unimportant ones is an act of construction, of the creation of a story, not the reconstruction of a story that was already there, ready to be uncovered. Any sequence of events or list of contributory or causal factors already smuggles a whole array of selection mechanisms and criteria into the supposed ‘‘re’’construction. There is no objective way of doing this—all these choices are affected, more or less tacitly, by the analyst’s background, preferences, experiences, biases, beliefs and purposes. ‘‘Events’’ are themselves defined and delimited by the stories with which the analyst configures them, and are impossible to imagine outside this selective, exclusionary, narrative fore-structure (Cronon, 1992).

Here is a thought exercise: what if we were to try to use the Five Whys for finding the “root cause” of a success?

Why didn’t we have failure X today?

Now this question gets a lot more difficult to have one answer. This is because things go right for many reasons, and not all of them obvious. We can spend all day writing down reasons why we didn’t have failure X today, and if we’re committed, we can keep going.

So if success requires “multiple contributing conditions, each necessary but only jointly sufficient” to happen, then how is it that failure only requires just one? The Five Whys, as its commonly presented as an approach to improvement (or: learning?), will lead us to believe that not only is just one condition sufficient, but that condition is a canonical one, to the exclusion of all others.

* RCA, or “Root Cause Analysis” can also easily turn into “Retrospective Cover of Ass”

References

Carroll, J. S. (1995). Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability. Organization & Environment, 9(2), 175–197. doi:10.1177/108602669500900203

Dekker, S. (2004). Ten questions about human error: A new view of human factors and system safety. Mahwah, N.J: Lawrence Erlbaum.

Dekker, S., Cilliers, P., & Hofmeyr, J.-H. (2011). The complexity of failure: Implications of complexity theory for safety investigations. Safety Science, 49(6), 939–945. doi:10.1016/j.ssci.2011.01.008

Hollnagel, E. (2009). The ETTO principle: Efficiency-thoroughness trade-off : why things that go right sometimes go wrong. Burlington, VT: Ashgate.
Leveson, N. (2012). Engineering a Safer World. Mit Press.

 

 

Testing Evolution’s git master and GNOME continuous

I’ve wanted a feature in Evolution for a while. It was formally requested in 2002, and it just recently got fixed in git master. I only started publicly groaning about this missing feature in 2013, and mcrha finally patched it. I tested the feature and found a small bug, mcrha patched that too, and I finally re-tested it. Now I’m blogging about this process so that you can get involved too!

Why Evolution?

  • Evolution supports GPG (Geary doesn’t, Gmail doesn’t)
  • Evolution has a beautiful composer (Gmail’s sucks, just try to reply inline)
  • Evolution is Open Source and Free Software (Gmail is proprietary)
  • Evolution integrates with GNOME (Gmail doesn’t)
  • Evolution has lots of fancy, mature features (Geary doesn’t)
  • Evolution cares about your privacy (Gmail doesn’t)

The feature:

I’d like to be able to select a bunch of messages and click an archive action to move them to a specific folder. Gmail popularized this idea in 2004, two years after it was proposed for Evolution. It has finally landed.

In your account editor, you can select the “Archive Folder” that you want messages move to:

evolution-account-archive-folder

This will let you have a different folder set per account.

Archive like Gmail does:

If you use Evolution with a Gmail account, and you want the same functionality as the Gmail archive button, you can accomplish this by setting the Evolution archive folder to point to the Gmail “All Mail” folder, which will cause the Evolution archive action to behave as Gmail’s does.

To use this functionality (with or without Gmail), simply select the messages you want to move, and click the “Archive…” button:

evolution-context-menu-archive

This is also available via the “Message” menu. You can also activate with the Control-Alt-a shortcut. For more information, please read the description from mcrha.

GNOME Continuous:

Once the feature was patched in git master, I wanted to try it out right away! The easiest way for me to do this, was to use the GNOME Continuous project that walters started. This GNOME project automatically kicks off integration builds of multiple git master trees for the most common GNOME applications.

If you follow the Gnome Continuous instructions, it is fairly easy to download an image, and then import it with virt-manager or boxes. Once it had booted up, I logged into the machine, and was able to test Evolution’s git master.

Digging deep into the app:

If you want to tweak the app for debugging purposes, it is quite easy to do this with GTKInspector. Launch it with Control-Shift-i or Control-Shift-d, and you’ll soon be poking around the app’s internals. You can change the properties you want in real-time, and then you’ll know which corresponding changes in the upstream source are necessary.

Finding a bug and re-testing:

I did find one small bug with the Evolution patches. I’m glad I found it now, instead of having to wait six months for a new Fedora version. The maintainer fixed it quickly, and all that was left to do was to re-test the new git master. To do this, I updated my GNOME Continuous image.

  1. Click on Control-Alt-F2 from the virt-manager “Send Key” menu.
  2. Log in as root (no password)
  3. Set the password to something by running the passwd command.
  4. Click on Control-Alt-F1 to return to your GNOME session.
  5. Open a terminal and run: pkexec bash.
  6. Enter your root password.
  7. Run ostree admin upgrade.
  8. Once it has finished downloading the updates, reboot the vm.

You’ll now be able to test the newest git master. Please note that it takes a bit of time for it to build, so it is not instant, but it’s pretty quick.

Taking screenshots:

I took a few screenshots from inside the VM to show to you in this blog post. Extracting them was a bit trickier because I couldn’t get SSHD running. To do so, I installed the guestfs browser on my host OS. It was very straight forward to use it to read the VM image, browse to the ~/Pictures/ directory, and then download the images to my host. Thanks rwmjones!

Conclusion:

Hopefully this will motivate you to contribute to GNOME early and often! There are lots of great tools available, and lots of applications that need some love.

Happy Hacking,

James


It’s the 5th Anniversary of DevOps

I’ve been proud to have played a part in the rise of the global phenomenon known as DevOps Days. If you aren’t aware of the history of the DevOps movement, it traces it’s roots (and name) directly back to the first DevOps Days event, organized by Patrick Debois, in Ghent Belgium.

For it’s 5th anniversary, DevOps Days is returning to Ghent. John Willis recorded these short interviews with some of the original attendees to commemorate the upcoming milestone event.

The post It’s the 5th Anniversary of DevOps appeared first on dev2ops.

Hacking out an Openshift app

I had an itch to scratch, and I wanted to get a bit more familiar with Openshift. I had used it in the past, but it was time to have another go. The app and the code are now available. Feel free to check out:

https://pdfdoc-purpleidea.rhcloud.com/

This is a simple app that takes the URL of a markdown file on GitHub, and outputs a pandoc converted PDF. I wanted to use pandoc specifically, because it produces PDF’s that were beautifully created with LaTeX. To embed a link in your upstream documentation that points to a PDF, just append the file’s URL to this app’s url, under a /pdf/ path. For example:

https://pdfdoc-purpleidea.rhcloud.com/pdf/https://github.com/purpleidea/puppet-gluster/blob/master/DOCUMENTATION.md

will send you to a PDF of the puppet-gluster documentation. This will make it easier to accept questions as FAQ patches, without needing to have the git embedded binary PDF be constantly updated.

If you want to hear more about what I did, read on…

The setup:

Start by getting a free Openshift account. You’ll also want to install the client tools. Nothing is worse than having to interact with your app via a web interface. Hackers use terminals. Lucky, the Openshift team knows this, and they’ve created a great command line tool called rhc to make it all possible.

I started by following their instructions:

$ sudo yum install rubygem-rhc
$ sudo gem update rhc

Unfortunately, this left with a problem:

$ rhc
/usr/share/rubygems/rubygems/dependency.rb:298:in `to_specs': Could not find 'rhc' (>= 0) among 37 total gem(s) (Gem::LoadError)
    from /usr/share/rubygems/rubygems/dependency.rb:309:in `to_spec'
    from /usr/share/rubygems/rubygems/core_ext/kernel_gem.rb:47:in `gem'
    from /usr/local/bin/rhc:22:in `'

I solved this by running:

$ gem install rhc

Which makes my user rhc to take precedence over the system one. Then run:

$ rhc setup

and the rhc client will take you through some setup steps such as uploading your public ssh key to the Openshift infrastructure. The beauty of this tool is that it will work with the Red Hat hosted infrastructure, or you can use it with your own infrastructure if you want to host your own Openshift servers. This alone means you’ll never get locked in to a third-party providers terms or pricing.

Create a new app:

To get a fresh python 3.3 app going, you can run:

$ rhc create-app <appname> python-3.3

From this point on, it’s fairly straight forward, and you can now hack your way through the app in python. To push a new version of your app into production, it’s just a git commit away:

$ git add -p && git commit -m 'Awesome new commit...' && git push && rhc tail

Creating a new app from existing code:

If you want to push a new app from an existing code base, it’s as easy as:

$ rhc create-app awesomesauce python-3.3 --from-code https://github.com/purpleidea/pdfdoc
Application Options
-------------------
Domain:      purpleidea
Cartridges:  python-3.3
Source Code: https://github.com/purpleidea/pdfdoc
Gear Size:   default
Scaling:     no

Creating application 'awesomesauce' ... done


Waiting for your DNS name to be available ... done

Cloning into 'awesomesauce'...
The authenticity of host 'awesomesauce-purpleidea.rhcloud.com (203.0.113.13)' can't be established.
RSA key fingerprint is 00:11:22:33:44:55:66:77:88:99:aa:bb:cc:dd:ee:ff.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'awesomesauce-purpleidea.rhcloud.com,203.0.113.13' (RSA) to the list of known hosts.

Your application 'awesomesauce' is now available.

  URL:        http://awesomesauce-purpleidea.rhcloud.com/
  SSH to:     00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com
  Git remote: ssh://00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com/~/git/awesomesauce.git/
  Cloned to:  /home/james/code/awesomesauce

Run 'rhc show-app awesomesauce' for more details about your app.

In my case, my app also needs some binaries installed. I haven’t yet automated this process, but I think it can be done be creating a custom cartridge. Help to do this would be appreciated!

Updating your app:

In the case of an app that I already deployed with this method, updating it from the upstream source is quite easy. You just pull down and relevant commits, and then push them up to your app’s git repo:

$ git pull upstream master 
From https://github.com/purpleidea/pdfdoc
 * branch            master     -> FETCH_HEAD
Updating 5ac5577..bdf9601
Fast-forward
 wsgi.py | 2 --
 1 file changed, 2 deletions(-)
$ git push origin master 
Counting objects: 7, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 312 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Stopping Python 3.3 cartridge
remote: Waiting for stop to finish
remote: Waiting for stop to finish
remote: Building git ref 'master', commit bdf9601
remote: Activating virtenv
remote: Checking for pip dependency listed in requirements.txt file..
remote: You must give at least one requirement to install (see "pip help install")
remote: Running setup.py script..
remote: running develop
remote: running egg_info
remote: creating pdfdoc.egg-info
remote: writing pdfdoc.egg-info/PKG-INFO
remote: writing dependency_links to pdfdoc.egg-info/dependency_links.txt
remote: writing top-level names to pdfdoc.egg-info/top_level.txt
remote: writing manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: reading manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: writing manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: running build_ext
remote: Creating /var/lib/openshift/00112233445566778899aabb/app-root/runtime/dependencies/python/virtenv/venv/lib/python3.3/site-packages/pdfdoc.egg-link (link to .)
remote: pdfdoc 0.0.1 is already the active version in easy-install.pth
remote: 
remote: Installed /var/lib/openshift/00112233445566778899aabb/app-root/runtime/repo
remote: Processing dependencies for pdfdoc==0.0.1
remote: Finished processing dependencies for pdfdoc==0.0.1
remote: Preparing build for deployment
remote: Deployment id is 9c2ee03c
remote: Activating deployment
remote: Starting Python 3.3 cartridge (Apache+mod_wsgi)
remote: Application directory "/" selected as DocumentRoot
remote: Application "wsgi.py" selected as default WSGI entry point
remote: -------------------------
remote: Git Post-Receive Result: success
remote: Activation status: success
remote: Deployment completed with status: success
To ssh://00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com/~/git/awesomesauce.git/
   5ac5577..bdf9601  master -> master
$

Final thoughts:

I hope this helped you getting going with Openshift. Feel free to send me patches!

Happy hacking!

James


Supporting Millions of Pretty URL Rewrites in Nginx with Lua and Redis

About a year ago, I was tasked with greatly expanding our url rewrite capabilities. Our file based, nginx rewrites were becoming a performance bottleneck and we needed to make an architectural leap to that would take us to the next level of SEO wizardry.

In comparison to the total number of product categories in our database, Stylight supports a handful of “pretty URLs” – those understandable by a human being. Take http://www.stylight.com/Sandals/Women/ – pretty obvious what’s going to be on that page, right?

Our web application, however, only understands http://www.stylight.com/search.action?gender=women&tag=10580&tag=10630. So, nginx needs to translate pretty URLs into something our app can find, fetch and return to your page. And this needs to happen as fast as computationally possible. Click on that link and you’ll notice we redirect you to the pretty URL. This is because we’ve found out women really love sandals so we want to give them a page they’d like to bookmark.

We import and update millions of products a day, so the vast majority of our links start out as “?tag=10580″. Googlebot knows how dynamic our site is, so it’s constantly crawling and indexing these functional links to feed its search results. As we learn from our users and ad campaigns which products are really interesting, we dynamically assign pretty URLs and inform Google with 301 redirects.

This creates 2 layers of redirection and doubles the urls our webserver needs to know about:

  • 301 redirects for the user (and search engines): ?gender=women&tag=10580&tag=10630 -> /Sandals/Women/
  • internal rewrites for our app: /Sandals/Women/ -> ?gender=women&tag=10580&tag=10630

So, how can we provide millions of pretty URLs to showcase all facets of our product search results?

The problem with file based, nginx rewrites: memory & reload times

With 800K rewrites and redirects (or R&Rs for short) in over 12 country rewrite.conf files, our “next level” initially means about ~8 million R&Rs urls. But we could barely cope our current requirements.

File based R&Rs are loaded into memory for all 16 nginx workers. Besides 3GB of RAM, it took almost 5 seconds just to reload or restart nginx! As a quick test, I doubled the amount of rewrites for one country. 20 seconds later nginx was successfully reload and running with 3.5GB of memory. Talk about “scale fail”.

What are the alternatives?

Google searching for nginx with millions of rewrites or redirects didn’t give a whole lot of insight, but digging through what I found eventually led me to OpenResty. Not being a full-time sysadmin, I don’t care to build and maintain custom binaries.

My next search for OpenResty on Ubuntu Trusty led me to lua-nginx-redis – perhaps not the most performant solution, but I’d take the compromise for community supported patches. A sudo apt-get install lua-nginx-redis gave us the basis for our new architecture.

As an initial test, I copied our largest country’s rewrites into redis, made a quick lua script for handling the rewrites and made my first head-to-head test:
redis-vs-nginx

I included network round trip times in my test to get an idea of the complete performance improvement we hoped to realize with this re-architecture. Interesting how quite a few URLs (those towards the bottom of the rewrite file) caused significant spikes in respone times. From these initial results, we decided to make the investment and completely overhaul our rewrite and redirect infrastructure.

The 301 redirects lived exclusively on the frontend load balancers while the internal rewrites were handled by our app servers. First order of business would be to combine these, leaving the application to concentrate on just serving requests. Next, we set up a cronjob to incrementally update R&Rs every 5 minutes. I gave the R&Rs a TTL of one month to keep the redis db tidy. Weekly, we run full insert which resets the TTL. And, yes, we monitor the TTLs of our R&Rs – don’t want all them disappearing over night!

The performance of Lua and Redis

We launched the new solution in the middle of July this year – just over three months ago.
And our average response time during the same period:
pingdom_response_time

As you can see, despite rapidly growing traffic, we saw the first significant improvements to our site’s response time just by moving the R&Rs out of files and into redis. Reload times for nginx are instant – there are no more rewrites it to load and distribute per worke – and memory usage has dropped below 900MB.

Since the launch, we’ve double our number of R&Rs (checkout how the memory scales):
redis_keys_memory_growth

Soon we’ll be able to serve all our URLs like http://www.stylight.com/Dark-Green/Long-Sleeve/T-Shirts/Gap/Men/ by default. No, we’re not quite there yet, but if you need that kinda shirt

We’ve got a lot of SEO work ahead of us which will require millions more rewrites. And now we have a performant architecture which will support it. If you have any questions or would like to know more details, don’t hesitate to contact me @danackerson.

Continuous integration for Puppet modules

I just patched puppet-gluster and puppet-ipa to bring their infrastructure up to date with the current state of affairs…

What’s new?

  • Better README’s
  • Rake syntax checking (fewer oopsies)
  • CI (testing) with travis on git push (automatic testing for everyone)
  • Use of .pmtignore to ignore files from puppet module packages (finally)
  • Pushing modules to the forge with blacksmith (sweet!)

This last point deserves another mention. Puppetlabs created the “forge” to try to provide some sort of added value to their stewardship. Personally, I like to look for code on github instead, but nevertheless, some do use the forge. The problem is that to upload new releases, you need to click your mouse like a windows user! Someone has finally solved that problem! If you use blacksmith, a new build is just a rake push away!

Have a look at this example commit if you’re interested in seeing the plumbing.

Better documentation and FAQ answering:

I’ve answered a lot of questions by email, but this only helps out individuals. From now on, I’d appreciate if you asked your question in the form of a patch to my FAQ. (puppet-gluster, puppet-ipa)

I’ll review and merge your patch, including a follow-up patch with the answer! This way you’ll get more familiar with git and sending small patches, everyone will benefit from the response, and I’ll be able to point you to the docs (and even a specific commit) to avoid responding to already answered questions. You’ll also have the commit information of something else who already had this problem. Cool, right?

Happy hacking,

James


A real whisper-to-InfluxDB program.

The whisper-to-influxdb migration script I posted earlier is pretty bad. A shell script, without concurrency, and an undiagnosed performance issue. I hinted that one could write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. That's what I did now, it's at github.com/vimeo/whisper-to-influxdb. It uses configurable amounts of workers for both whisper fetches and InfluxDB commits, but it's still a bit naive in the sense that it commits to InfluxDB one serie at a time, irrespective of how many records are in it. My series, and hence my commits have at most 60k records, and presumably InfluxDB could handle a lot more per commit, so we might leverage better batching later. Either way, this way I can consistently commit about 100k series every 2.5 hours (or 10/s), where each serie has a few thousand points on average, with peaks up to 60k points. I usually play with 1 to 30 InfluxDB workers. Even though I've hit a few InfluxDB issues, this tool has enabled me to fill in gaps after outages and to do a restore from whisper after a complete database wipe.

Fixing dropbox “conflicted copy” problems

I usually avoid proprietary cloud services because of freedom, privacy and vendor lock-in concerns. In addition, there are some excellent libre (and hosted) services such as WordPress, Wikipedia and OpenShift which don’t have the above problems. Thirdly, there are every day Free Software tools such as Fedora GNU/Linux, Libreoffice, and git-annex-assistant which make my computing much more powerful. Finally, there are some hosted services that I use that don’t lock me in because I use them as push-only mirrors, and I only interact with them using Free Software tools. The two examples are GitHub and Dropbox.

Today, Dropbox bit me. Here’s how I saved my data.

Dropbox integrates with GNOME‘s nautilus to sync your data to their proprietary cloud hosting. I periodically run the dropbox client to sync any changes to my public files up to their servers. Today, the client decided that some of my newer files were older than the stored server-side versions, and promptly over-wrote my newer versions.

Thankfully I have real backups, and, to be fair, Dropbox actually renamed my newer files instead of blatantly clobbering them. My filesystem now looks like this:

$ tree files/
files/
|-- bar
|-- baz
|   |-- file1
|   |-- file1 (james's conflicted copy 2014-09-29)
|   |-- file2 (james's conflicted copy 2014-09-29).sh
|   `-- file2.sh
`-- foo
    `-- magic.sh

You’ll note that my previously clean file system now has the “conflicted copy” versions everywhere. These are the good versions, whereas in the example above file1 and file2.sh are the older unwanted versions.

I spent some time with find and diff convincing myself that this was true, and eventually I wrote a script. The script looks through the current working directory for “conflicted copy” matches, saves the unwanted versions (just in case) and then clobbers them with the good “conflicted” version.

Please look through, edit, and understand this script before running it. It might not be what you want, and it was designed to only work for me. It is available as a gist, and below in the body of this article.

$ cat fix-dropbox.sh 
#!/bin/bash

# XXX: use at your own risk - do not run without understanding this first!
exit 1

# safety directory
BACKUP='/tmp/fix-dropbox/'

# TODO: detect or pick manually...
NAME=`hostname`
#NAME='myhostname'
DATE='2014-09-29'

mkdir -p "$BACKUP"
find . -path "*(*'s conflicted copy [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*" -print0 | while read -d $'' -r file; do
    printf 'Found: %sn' "$file"

    # TODO: detect or pick manually...
    #NAME='XXX'
    #DATE='2014-09-29'

    STRING=" (${NAME}'s conflicted copy ${DATE})"
    #echo $STRING
    RESULT=`echo "$file" | sed "s/$STRING//"`
    #echo $RESULT

    SAVE="$BACKUP"`dirname "$RESULT"`
    #echo $SAVE
    mkdir -p "$SAVE"
    cp "$RESULT" "$SAVE"
    mv "$file" "$RESULT"

done

You can thank bash for saving your data. Stop bashing it and read this article instead.

Happy hacking,

James

 


InfluxDB as a graphite backend, part 2

The Graphite + InfluxDB series continues.

  • In part 1, "On Graphite, Whisper and InfluxDB" I described the problems of Graphite's whisper and ceres, why I disagree with common graphite clustering advice as being the right path forward, what a great timeseries storage system would mean to me, why InfluxDB - despite being the youngest project - is my main interest right now, and introduced my approach for combining both and leveraging their respective strengths: InfluxDB as an ingestion and storage backend (and at some point, realtime processing and pub-sub) and graphite for its renown data processing-on-retrieval functionality. Furthermore, I introduced some tooling: carbon-relay-ng to easily route streams of carbon data (metrics datapoints) to storage backends, allowing me to send production data to Carbon+whisper as well as InfluxDB in parallel, graphite-api, the simpler Graphite API server, with graphite-influxdb to fetch data from InfluxDB.
  • Not Graphite related, but I wrote influx-cli which I introduced here. It allows to easily interface with InfluxDB and measure the duration of operations, which will become useful for this article.
  • In the Graphite & Influxdb intermezzo I shared a script to import whisper data into InfluxDB and noted some write performance issues I was seeing, but the better part of the article described the various improvements done to carbon-relay-ng, which is becoming an increasingly versatile and useful tool.
  • In part 2, which you are reading now, I'm going to describe recent progress, share more info about my setup, testing results, state of affairs, and ideas for future work

Progress made

  • InfluxDB saw two major releases:
    • 0.7 (and followups), which was mostly about some needed features and bug fixes
    • 0.8 was all about bringing some major refactorings in the hands of early adopters/testers: support for multiple storage engines, configurable shard spaces, rollups and retention schemes. There was some other useful stuff like speed and robustness improvements for the graphite input plugin (by yours truly) and various things like regex filtering for 'list series'. Note that a bunch of older bugs remained open throughout this release (most notably the broken derivative aggregator), and a bunch of new ones appeared. Maybe this is why the release was mostly in the dark. In this context, it's not so bad, because we let graphite-api do all the processing, but if you want to query InfluxDB directly you might hit some roadblocks.
    • An older fix, but worth mentioning: series names can now also contain any character, which means you can easily use metrics2.0 identifiers. This is a welcome relief after having struggled with Graphite's restrictions on metric keys.
  • graphite-api received various bug fixes and support for templating, statsd instrumentation and caching.
    Much of this was driven by graphite-influxdb: the caching allows us to cache metadata and the statsd integration gives us insights into the performance of the steps it goes through of building a graph (getting metadata from InfluxDB, querying InfluxDB, interacting with cache, post processing data, etc).
  • the progress on InfluxDB and graphite-api in turn enabled graphite-influxdb to become faster and simpler (note: graphite-influxdb requires InfluxDB 0.8). Furthermore you can now configure series resolutions (but different retentions per serie is on the roadmap, see State of affairs and what's coming), and of course it also got a bunch of bugfixes.
Because of all these improvements, all involved components are now ready for serious use.

Putting it all together, with docker

Docker probably needs no introduction, it's a nifty tool to build an environment with given software installed, and allows to easily deploy it and run it in isolation. graphite-api-influxdb-docker is a very creatively named project that generates the - also very creatively named - docker image graphite-api-influxdb, which contains graphite-api and graphite-influxdb, making it easy to hook in a customized configuration and get it up and running quickly. This is the recommended way to set this up, and this is what we run in production.

The setup

  • a server running InfluxDB and graphite-api with graphite-influxdb via the docker approach described above:
    dell PowerEdge R610
    24 x Intel(R) Xeon(R) X5660  @ 2.80GHz
    96GB RAM
    perc raid h700
    6x600GB seagate 10k rpm drives in raid10 = 1.6 TB, Adaptive Read Ahead, Write Back, 64 kB blocks, no read caching
    no sharding/shard spaces, compiled from git just before 0.8, using LevelDB (not rocksdb, which is now the default)
    LevelDB max-open-files = 10000 (lsof shows about 30k open files total for the InfluxDB process), LRU 4096m, everything else is default I think.
    
  • a server running graphite-web, carbon, and whisper:
    dell PowerEdge R710
    16 x Intel(R) Xeon(R) E5640  @ 2.67GHz
    96GB RAM
    perc raid h700
    8x150GB seagate 15k rm in raid5 = 952 GB, Read Ahead, Write Back, 64 kB blocks, no read caching
    MAX_UPDATES_PER_SECOND = 1000  # to sequentialize writes
    
  • a relay server running carbon-relay-ng that sends the same production load into both. (about 2500 metrics/s, or 150k minutely)
As you can tell, on both machines RAM is vastly over provisioned, and they have lots of cpu available (the difference in cores should be negligible), but the difference in RAID level is important to note: RAID 5 comes with a write penalty. Even though the whisper machine has more, and faster disks, it probably has a disadvantage for writes. Maybe. Haven't done raid stuff in a long time, and I haven't it measured it out.
Clearly you'll need to take the results with a grain of salt, as unfortunately I do not have 2 systems available with the same configuration and their baseline (raw) performance is unknown..
Note: no InfluxDB clustering, see State of affairs and what's coming.

The empirical validation & migration

Once everything was setup and I could confidently send 100% of traffic to InfluxDB via carbon-relay-ng, it was trivial to run our dashboards with a flag deciding which server to go to. This way I have literally been running our graphite dashboards next to each other, allowing us to compare both stacks on:
  • visual differences: after a bunch of work and bug fixing, we got to a point where both dashboards looked almost exactly the same. (note that graphite-api's implementation of certain functions can behave slightly different, see for example this divideSeries bug)
  • speed differences by simply refreshing both pages and watching the PNGs load, with some assistance from firebug's network requests profiler. The difference here was big: graphs served up by graphite-api + InfluxDB loaded considerably faster. A page with 40 graphs or so would load in a few seconds instead of 20-30 seconds (on both first, as well as subsequent hits). This is for our default, 6-hour timeframe views. When cranking the timeframes up to a couple of weeks, graphite-api + InfluxDB was still faster.
Soon enough my colleagues started asking to make graphite-api + InfluxDB the default, as it was much faster in all common cases. I flipped the switch and everybody has been happy.

When loading a page with many dashboards, the InfluxDB machine will occasionally spike up to 500% cpu, though I rarely get to see any iowait (!), even after syncing the block cache (i just realized it'll probably still use the cache for reads after sync?)
The carbon/whisper machine, on the other hand, is always fighting iowait, which could be caused by the raid 5 write amplification but the random io due to the whisper format probably has more to do with it. Via the MAX_UPDATES_PER_SECOND I've tried to linearize writes, with mixed success. But I've never gone to deep into it. So basically comparing write performance would be unfair in these circumstances, I am only comparing reads in these tests. Despite the different storage setups, the Linux block cache should make things fair for reads. Whisper's iowait will handicap the reads, but I always did successive runs with fully loaded PNGs to make sure the block cache was warm for reads.

A "slightly more professional" benchmark

I could have stopped here, but the validation above was not very scientific. I wanted to do a somewhat more formal benchmark, to measure read speeds (though I did not have much time so it had to be quick and easy).
I wanted to compare InfluxDB vs whisper, and specifically how performance scales as you play with parameters such as number of series, points per series, and time range fetched (i.e. amount of points). I posted the benchmark on the InfluxDB mailing list. Look there for all information. I just want to reiterate the conclusion here: I was surprised. Because of the results above, I had assumed that InfluxDB would perform reads noticeably quicker than whisper but this is not the case. (maybe because whisper reads are nicely sequential - it's mostly writes that suffer from the whisper format)
This very much contrasts my earlier findings where the graphite-api+InfluxDB powered dashboards clearly take the lead. I have yet to figure out why this is. Maybe something to do with the performance of graphite-web vs graphite-api itself, gunicorn vs apache, worker configuration, or maybe InfluxDB only starts outperforming whisper as concurrency increases. Some more investigation is definitely needed!

Future benchmarks

The simple benchmark above was very simple to execute, as it only requires influx-cli and whisper-fetch (so you can easily check for yourself), but clearly there is a need to test more realistic scenarios with concurrent reads, and doing some write benchmarks would be nice too.
We should also look into cpu and memory usage. I have had the luxury of being able to completely ignore memory usage, but others seem to notice excessive InfluxDB memory usage.
I would also like to see storage efficiency tests. Last time I checked, using LevelDB I was pretty close to 24B per record (which makes sense because time, seq_no and value are all 64bit values, and each record has those 3 fields). (this was with snappy enabled, so it didn't seem to give much benefit). With whisper, I have files where the file size in Bytes divided by total records comes down to 114, for others 31. I haven't looked much into it but it looks like at least InfluxDB is more storage efficient. Also, whisper explicitly encodes None values of course, with InfluxDB those are implied (and require no space)

conclusion: many tests and benchmarks should happen, but I don't really have time to conduct them. Hopefully other people in the community will take this on.

State of affairs and what's coming

  • InfluxDB typically performs pretty well, but not in all cases. More validation is needed. It wouldn't surprise me at this point if tools like hbase/Cassandra/riak clearly outperform InfluxDB, as long as we keep in mind that InfluxDB is a young project. A year, or two, from now, it'll probably perform much better. (and then again, it's not all about raw performance. InfluxDB's has other strengths)
  • A long time goal which is now a reality: You can use any Graphite dashboard on top of InfluxDB, as long as the data is stored in a graphite-compatible format.. Again, the easiest to get running is via graphite-api-influxdb-docker. There are two issues to be mentioned, though:
  • With the 0.8 release out the door, the shard spaces/rollups/retention intervals feature will start stabilizing, so we can start supporting multiple retention intervals per metric
  • Because InfluxDB clustering is undergoing major changes, and because clustering is not a high priority for me, I haven't needed to worry about this. I'll probably only start looking at clustering somewhere in 2015 because I have more pressing issues.
  • Once the new clustering system and the storage subsystem have matured (sounds like a v1.0 ~ v1.2 to me) we'll get more speed improvements and robustness. Most of the integration work is done, it's just a matter of doing smaller improvements, bug fixes and waiting for InfluxDB to become better. Maintaining this stack aside, I personally will start focusing more on:
    • per-second resolution in our data feeds, and potentially storage
    • realtime (but basic) anomaly detection, realtime graphs for some key timeseries. Adrian Cockcroft had an inspirational piece in his Monitorama keynote about how alerts from timeseries should trigger within seconds.
    • Mozilla's awesome heka project (this heka video is great), which should help a lot with the above. Also looking at Etsy's kale stack for anomaly detection
    • metrics 2.0 and making sure metrics 2.0 works well with InfluxDB. Up to now I find the series / columns as a data model too limiting and arbitrary, it could be so much more powerful, ditto for the query language.
  • Can we do anything else to make InfluxDB (+graphite) faster? Yes!
    • Long term, of course, InfluxDB should have powerful enough processing functions and query syntax, so that we don't even need a graphite layer anymore.
    • A storage engine optimized for fixed intervals would probably help, we could have the timestamps implicit instead of explicit. And maybe making the sequence number field optional. Each of these fields currently consumes 1/3 of the record... The sequence number field is not only useless in the Graphite use case, I've also rarely seen people make use of this in other use cases. Not storing the values as 64bit floats would help too. Finally we could have InfluxDB have fill in None values without it doing "group by" (timeframe consolidation)
    • Then of course, there are projects to replace graphite-web/graphite-api with a Go codebase: graphite-ng and carbonapi. the latter is more production ready, but depends on some custom tooling and io using protobufs. But it performs an order of magnitude better than the python api server! I haven't touched graphite-ng in a while, but hopefully at some point I can take it up again
  • Another thing to keep in mind when switching to graphite-api + InfluxDB: you loose the graphite composer. I have a few people relying on this, so I can either patch it to talk to graphite-api (meh), separate it out (meh) or replace it with a nicer dashboard like tessera, grafana or descartes. (or Graph-Explorer, but it can be a bit too much of a paradigm shift).
  • some more InfluxDB stuff I'm looking forward to:
    • binary protocol and result streaming (faster communication and responses!) (the latter might not get implemented though)
    • "list series" speed improvements (if metadata querying gets fast enough, we won't need ES anymore for metrics2.0 index)
    • InfluxDB instrumentation so we actually start getting an idea of what's going on in the system, a lot of the testing and troubleshooting is still in the dark.
  • Tracking exceptions in graphite-api is much harder than it should be. Currently there's no way to display exceptions to the user (in the http response) or to even log them. So sometimes you'll get http 500 responses and don't know why. You can use the sentry integration which works all right, but is clunky. Hopefully this will be addressed soon.

Conclusion

The graphite-influxdb stack works and is ready for general consumption. It's easy to install and operate, and performs well. It is expected that InfluxDB will over time mature and ultimately meet all my requirements of the ideal backend. It definitely has a long way to go. More benchmarks and tests are needed. Keep in mind that we're not doing large volumes of metrics. For small/medium shops this solution should work well, but on larger scales you will definitely run into issues. You might conclude that InfluxDB is not for you (yet) (there are alternative projects, after all).

Finally, a closing thought:
Having graphs and dashboards that look nice and load fast is a good thing to have, but keep in mind that graphs and dashboards should be a last resort. It's a solution if all else fails. The fewer graphs you need, the better you're doing.
How can you avoid needing graphs? Automatic alerting on your data.

I see graphs as a temporary measure: they provide headroom while you develop an understanding of the operational behavior of your infrastructure, conceive a model of it, and implement the alerting you need to do troubleshooting and capacity planning. Of course, this process consumes more resources (time and otherwise), and these expenses are not always justifiable, but I think this is the ideal case we should be working towards.


Either way, good luck and have fun!

Graphite &amp; Influxdb intermezzo: migrating old data and a more powerful carbon relay

Migrating data from whisper into InfluxDB

"How do i migrate whisper data to influxdb" is a question that comes up regularly, and I've always replied it should be easy to write a tool to do this. I personally had no need for this, until a recent small influxdb outage where I wanted to sync data from our backup server (running graphite + whisper) to influxdb, so I wrote a script:
#!/bin/bash
# whisper dir without trailing slash.
wsp_dir=/opt/graphite/storage/whisper
start=$(date -d 'sep 17 6am' +%s)
end=$(date -d 'sep 17 12pm' +%s)
db=graphite
pipe_path=$(mktemp -u)
mkfifo $pipe_path
function influx_updater() {
    influx-cli -db $db -async < $pipe_path
}
influx_updater &
while read wsp; do
  series=$(basename ${wsp////.} .wsp)
  echo "updating $series ..."
  whisper-fetch.py --from=$start --until=$end $wsp_dir/$wsp.wsp | grep -v 'None$' | awk '{print "insert into "'$series'" values ("$1"000,1,"$2")"}' > $pipe_path
done < <(find $wsp_dir -name '*.wsp' | sed -e "s#$wsp_dir/##" -e "s/.wsp$//")

It relies on the recently introduced asynchronous inserts feature of influx-cli - which commits inserts in batches to improve the speed - and the whisper-fetch tool.
You could probably also write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. But I wanted to keep it simple. Especially when I found out that whisper-fetch is not a bottleneck: starting whisper-fetch, and reading out - in my case - 360 datapoints of a file always takes about 50ms, whereas InfluxDB at first only needed a few ms to flush hundreds of records, but that soon increased to seconds.
Maybe it's a bug in my code, I didn't test this much, because I didn't need to; but people keep asking for a tool so here you go. Try it out and maybe you can fix a bug somewhere. Something about the write performance here must be wrong.

A more powerful carbon-relay-ng

carbon-relay-ng received a bunch of love and has been a great help in my graphite+influxdb experiments.

Here's what changed:
  • First I made it so that you can adjust routes at runtime while data is flowing through, via a telnet interface.
  • Then Paul O'Connor built an embedded web interface to manage your routes in an easier and prettier way (pictured above)
  • The relay now also emits performance metrics via statsd (I want to make this better by using go-metrics which will hopefully get expvar support at some point - any takers?).
  • Last but not least, I borrowed the diskqueue code from NSQ so now we can also spool to disk to bridge downtime of endpoints and re-fill them when they come back up
Beside our metrics storage, I also plan to put our anomaly detection (currently playing with heka and kale) and carbon-tagger behind the relay, centralizing all routing logic, making things more robust, and simplifying our system design. The spooling should also help to deploy to our metrics gateways at other datacenters, to bridge outages of datacenter interconnects.

I used to think of carbon-relay-ng as the python carbon-relay but on steroids, now it reminds me more of something like nsqd but with an ability to make packet routing decisions by introspecting the carbon protocol,
or perhaps Kafka but much simpler, single-node (no HA), and optimized for the domain of carbon streams.
I'd like the HA stuff though, which is why I spend some of my spare time figuring out the intricacies of the increasingly popular raft consensus algorithm. It seems opportune to have a simpler Kafka-like thing, in Go, using raft, for carbon streams. (note: InfluxDB might introduce such a component, so I'm also a bit waiting to see what they come up with)

Reminder: notably missing from carbon-relay-ng is round robin and sharding. I believe sharding/round robin/etc should be part of a broader HA design of the storage system, as I explained in On Graphite, Whisper and InfluxDB. That said, both should be fairly easy to implement in carbon-relay-ng, and I'm willing to assist those who want to contribute it.