↓ Archives ↓

It’s the 5th Anniversary of DevOps

I’ve been proud to have played a part in the rise of the global phenomenon known as DevOps Days. If you aren’t aware of the history of the DevOps movement, it traces it’s roots (and name) directly back to the first DevOps Days event, organized by Patrick Debois, in Ghent Belgium.

For it’s 5th anniversary, DevOps Days is returning to Ghent. John Willis recorded these short interviews with some of the original attendees to commemorate the upcoming milestone event.

The post It’s the 5th Anniversary of DevOps appeared first on dev2ops.

Hacking out an Openshift app

I had an itch to scratch, and I wanted to get a bit more familiar with Openshift. I had used it in the past, but it was time to have another go. The app and the code are now available. Feel free to check out:


This is a simple app that takes the URL of a markdown file on GitHub, and outputs a pandoc converted PDF. I wanted to use pandoc specifically, because it produces PDF’s that were beautifully created with LaTeX. To embed a link in your upstream documentation that points to a PDF, just append the file’s URL to this app’s url, under a /pdf/ path. For example:


will send you to a PDF of the puppet-gluster documentation. This will make it easier to accept questions as FAQ patches, without needing to have the git embedded binary PDF be constantly updated.

If you want to hear more about what I did, read on…

The setup:

Start by getting a free Openshift account. You’ll also want to install the client tools. Nothing is worse than having to interact with your app via a web interface. Hackers use terminals. Lucky, the Openshift team knows this, and they’ve created a great command line tool called rhc to make it all possible.

I started by following their instructions:

$ sudo yum install rubygem-rhc
$ sudo gem update rhc

Unfortunately, this left with a problem:

$ rhc
/usr/share/rubygems/rubygems/dependency.rb:298:in `to_specs': Could not find 'rhc' (>= 0) among 37 total gem(s) (Gem::LoadError)
    from /usr/share/rubygems/rubygems/dependency.rb:309:in `to_spec'
    from /usr/share/rubygems/rubygems/core_ext/kernel_gem.rb:47:in `gem'
    from /usr/local/bin/rhc:22:in `'

I solved this by running:

$ gem install rhc

Which makes my user rhc to take precedence over the system one. Then run:

$ rhc setup

and the rhc client will take you through some setup steps such as uploading your public ssh key to the Openshift infrastructure. The beauty of this tool is that it will work with the Red Hat hosted infrastructure, or you can use it with your own infrastructure if you want to host your own Openshift servers. This alone means you’ll never get locked in to a third-party providers terms or pricing.

Create a new app:

To get a fresh python 3.3 app going, you can run:

$ rhc create-app <appname> python-3.3

From this point on, it’s fairly straight forward, and you can now hack your way through the app in python. To push a new version of your app into production, it’s just a git commit away:

$ git add -p && git commit -m 'Awesome new commit...' && git push && rhc tail

Creating a new app from existing code:

If you want to push a new app from an existing code base, it’s as easy as:

$ rhc create-app awesomesauce python-3.3 --from-code https://github.com/purpleidea/pdfdoc
Application Options
Domain:      purpleidea
Cartridges:  python-3.3
Source Code: https://github.com/purpleidea/pdfdoc
Gear Size:   default
Scaling:     no

Creating application 'awesomesauce' ... done

Waiting for your DNS name to be available ... done

Cloning into 'awesomesauce'...
The authenticity of host 'awesomesauce-purpleidea.rhcloud.com (' can't be established.
RSA key fingerprint is 00:11:22:33:44:55:66:77:88:99:aa:bb:cc:dd:ee:ff.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'awesomesauce-purpleidea.rhcloud.com,' (RSA) to the list of known hosts.

Your application 'awesomesauce' is now available.

  URL:        http://awesomesauce-purpleidea.rhcloud.com/
  SSH to:     00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com
  Git remote: ssh://00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com/~/git/awesomesauce.git/
  Cloned to:  /home/james/code/awesomesauce

Run 'rhc show-app awesomesauce' for more details about your app.

In my case, my app also needs some binaries installed. I haven’t yet automated this process, but I think it can be done be creating a custom cartridge. Help to do this would be appreciated!

Updating your app:

In the case of an app that I already deployed with this method, updating it from the upstream source is quite easy. You just pull down and relevant commits, and then push them up to your app’s git repo:

$ git pull upstream master 
From https://github.com/purpleidea/pdfdoc
 * branch            master     -> FETCH_HEAD
Updating 5ac5577..bdf9601
 wsgi.py | 2 --
 1 file changed, 2 deletions(-)
$ git push origin master 
Counting objects: 7, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 312 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Stopping Python 3.3 cartridge
remote: Waiting for stop to finish
remote: Waiting for stop to finish
remote: Building git ref 'master', commit bdf9601
remote: Activating virtenv
remote: Checking for pip dependency listed in requirements.txt file..
remote: You must give at least one requirement to install (see "pip help install")
remote: Running setup.py script..
remote: running develop
remote: running egg_info
remote: creating pdfdoc.egg-info
remote: writing pdfdoc.egg-info/PKG-INFO
remote: writing dependency_links to pdfdoc.egg-info/dependency_links.txt
remote: writing top-level names to pdfdoc.egg-info/top_level.txt
remote: writing manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: reading manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: writing manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: running build_ext
remote: Creating /var/lib/openshift/00112233445566778899aabb/app-root/runtime/dependencies/python/virtenv/venv/lib/python3.3/site-packages/pdfdoc.egg-link (link to .)
remote: pdfdoc 0.0.1 is already the active version in easy-install.pth
remote: Installed /var/lib/openshift/00112233445566778899aabb/app-root/runtime/repo
remote: Processing dependencies for pdfdoc==0.0.1
remote: Finished processing dependencies for pdfdoc==0.0.1
remote: Preparing build for deployment
remote: Deployment id is 9c2ee03c
remote: Activating deployment
remote: Starting Python 3.3 cartridge (Apache+mod_wsgi)
remote: Application directory "/" selected as DocumentRoot
remote: Application "wsgi.py" selected as default WSGI entry point
remote: -------------------------
remote: Git Post-Receive Result: success
remote: Activation status: success
remote: Deployment completed with status: success
To ssh://00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com/~/git/awesomesauce.git/
   5ac5577..bdf9601  master -> master

Final thoughts:

I hope this helped you getting going with Openshift. Feel free to send me patches!

Happy hacking!


Supporting Millions of Pretty URL Rewrites in Nginx with Lua and Redis

About a year ago, I was tasked with greatly expanding our url rewrite capabilities. Our file based, nginx rewrites were becoming a performance bottleneck and we needed to make an architectural leap to that would take us to the next level of SEO wizardry.

In comparison to the total number of product categories in our database, Stylight supports a handful of “pretty URLs” – those understandable by a human being. Take http://www.stylight.com/Sandals/Women/ – pretty obvious what’s going to be on that page, right?

Our web application, however, only understands http://www.stylight.com/search.action?gender=women&tag=10580&tag=10630. So, nginx needs to translate pretty URLs into something our app can find, fetch and return to your page. And this needs to happen as fast as computationally possible. Click on that link and you’ll notice we redirect you to the pretty URL. This is because we’ve found out women really love sandals so we want to give them a page they’d like to bookmark.

We import and update millions of products a day, so the vast majority of our links start out as “?tag=10580″. Googlebot knows how dynamic our site is, so it’s constantly crawling and indexing these functional links to feed its search results. As we learn from our users and ad campaigns which products are really interesting, we dynamically assign pretty URLs and inform Google with 301 redirects.

This creates 2 layers of redirection and doubles the urls our webserver needs to know about:

  • 301 redirects for the user (and search engines): ?gender=women&tag=10580&tag=10630 -> /Sandals/Women/
  • internal rewrites for our app: /Sandals/Women/ -> ?gender=women&tag=10580&tag=10630

So, how can we provide millions of pretty URLs to showcase all facets of our product search results?

The problem with file based, nginx rewrites: memory & reload times

With 800K rewrites and redirects (or R&Rs for short) in over 12 country rewrite.conf files, our “next level” initially means about ~8 million R&Rs urls. But we could barely cope our current requirements.

File based R&Rs are loaded into memory for all 16 nginx workers. Besides 3GB of RAM, it took almost 5 seconds just to reload or restart nginx! As a quick test, I doubled the amount of rewrites for one country. 20 seconds later nginx was successfully reload and running with 3.5GB of memory. Talk about “scale fail”.

What are the alternatives?

Google searching for nginx with millions of rewrites or redirects didn’t give a whole lot of insight, but digging through what I found eventually led me to OpenResty. Not being a full-time sysadmin, I don’t care to build and maintain custom binaries.

My next search for OpenResty on Ubuntu Trusty led me to lua-nginx-redis – perhaps not the most performant solution, but I’d take the compromise for community supported patches. A sudo apt-get install lua-nginx-redis gave us the basis for our new architecture.

As an initial test, I copied our largest country’s rewrites into redis, made a quick lua script for handling the rewrites and made my first head-to-head test:

I included network round trip times in my test to get an idea of the complete performance improvement we hoped to realize with this re-architecture. Interesting how quite a few URLs (those towards the bottom of the rewrite file) caused significant spikes in respone times. From these initial results, we decided to make the investment and completely overhaul our rewrite and redirect infrastructure.

The 301 redirects lived exclusively on the frontend load balancers while the internal rewrites were handled by our app servers. First order of business would be to combine these, leaving the application to concentrate on just serving requests. Next, we set up a cronjob to incrementally update R&Rs every 5 minutes. I gave the R&Rs a TTL of one month to keep the redis db tidy. Weekly, we run full insert which resets the TTL. And, yes, we monitor the TTLs of our R&Rs – don’t want all them disappearing over night!

The performance of Lua and Redis

We launched the new solution in the middle of July this year – just over three months ago.
And our average response time during the same period:

As you can see, despite rapidly growing traffic, we saw the first significant improvements to our site’s response time just by moving the R&Rs out of files and into redis. Reload times for nginx are instant – there are no more rewrites it to load and distribute per worke – and memory usage has dropped below 900MB.

Since the launch, we’ve double our number of R&Rs (checkout how the memory scales):

Soon we’ll be able to serve all our URLs like http://www.stylight.com/Dark-Green/Long-Sleeve/T-Shirts/Gap/Men/ by default. No, we’re not quite there yet, but if you need that kinda shirt

We’ve got a lot of SEO work ahead of us which will require millions more rewrites. And now we have a performant architecture which will support it. If you have any questions or would like to know more details, don’t hesitate to contact me @danackerson.

Continuous integration for Puppet modules

I just patched puppet-gluster and puppet-ipa to bring their infrastructure up to date with the current state of affairs…

What’s new?

  • Better README’s
  • Rake syntax checking (fewer oopsies)
  • CI (testing) with travis on git push (automatic testing for everyone)
  • Use of .pmtignore to ignore files from puppet module packages (finally)
  • Pushing modules to the forge with blacksmith (sweet!)

This last point deserves another mention. Puppetlabs created the “forge” to try to provide some sort of added value to their stewardship. Personally, I like to look for code on github instead, but nevertheless, some do use the forge. The problem is that to upload new releases, you need to click your mouse like a windows user! Someone has finally solved that problem! If you use blacksmith, a new build is just a rake push away!

Have a look at this example commit if you’re interested in seeing the plumbing.

Better documentation and FAQ answering:

I’ve answered a lot of questions by email, but this only helps out individuals. From now on, I’d appreciate if you asked your question in the form of a patch to my FAQ. (puppet-gluster, puppet-ipa)

I’ll review and merge your patch, including a follow-up patch with the answer! This way you’ll get more familiar with git and sending small patches, everyone will benefit from the response, and I’ll be able to point you to the docs (and even a specific commit) to avoid responding to already answered questions. You’ll also have the commit information of something else who already had this problem. Cool, right?

Happy hacking,


A real whisper-to-InfluxDB program.

The whisper-to-influxdb migration script I posted earlier is pretty bad. A shell script, without concurrency, and an undiagnosed performance issue. I hinted that one could write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. That's what I did now, it's at github.com/vimeo/whisper-to-influxdb. It uses configurable amounts of workers for both whisper fetches and InfluxDB commits, but it's still a bit naive in the sense that it commits to InfluxDB one serie at a time, irrespective of how many records are in it. My series, and hence my commits have at most 60k records, and presumably InfluxDB could handle a lot more per commit, so we might leverage better batching later. Either way, this way I can consistently commit about 100k series every 2.5 hours (or 10/s), where each serie has a few thousand points on average, with peaks up to 60k points. I usually play with 1 to 30 InfluxDB workers. Even though I've hit a few InfluxDB issues, this tool has enabled me to fill in gaps after outages and to do a restore from whisper after a complete database wipe.

Fixing dropbox “conflicted copy” problems

I usually avoid proprietary cloud services because of freedom, privacy and vendor lock-in concerns. In addition, there are some excellent libre (and hosted) services such as WordPress, Wikipedia and OpenShift which don’t have the above problems. Thirdly, there are every day Free Software tools such as Fedora GNU/Linux, Libreoffice, and git-annex-assistant which make my computing much more powerful. Finally, there are some hosted services that I use that don’t lock me in because I use them as push-only mirrors, and I only interact with them using Free Software tools. The two examples are GitHub and Dropbox.

Today, Dropbox bit me. Here’s how I saved my data.

Dropbox integrates with GNOME‘s nautilus to sync your data to their proprietary cloud hosting. I periodically run the dropbox client to sync any changes to my public files up to their servers. Today, the client decided that some of my newer files were older than the stored server-side versions, and promptly over-wrote my newer versions.

Thankfully I have real backups, and, to be fair, Dropbox actually renamed my newer files instead of blatantly clobbering them. My filesystem now looks like this:

$ tree files/
|-- bar
|-- baz
|   |-- file1
|   |-- file1 (james's conflicted copy 2014-09-29)
|   |-- file2 (james's conflicted copy 2014-09-29).sh
|   `-- file2.sh
`-- foo
    `-- magic.sh

You’ll note that my previously clean file system now has the “conflicted copy” versions everywhere. These are the good versions, whereas in the example above file1 and file2.sh are the older unwanted versions.

I spent some time with find and diff convincing myself that this was true, and eventually I wrote a script. The script looks through the current working directory for “conflicted copy” matches, saves the unwanted versions (just in case) and then clobbers them with the good “conflicted” version.

Please look through, edit, and understand this script before running it. It might not be what you want, and it was designed to only work for me. It is available as a gist, and below in the body of this article.

$ cat fix-dropbox.sh 

# XXX: use at your own risk - do not run without understanding this first!
exit 1

# safety directory

# TODO: detect or pick manually...

mkdir -p "$BACKUP"
find . -path "*(*'s conflicted copy [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*" -print0 | while read -d $'' -r file; do
    printf 'Found: %sn' "$file"

    # TODO: detect or pick manually...

    STRING=" (${NAME}'s conflicted copy ${DATE})"
    #echo $STRING
    RESULT=`echo "$file" | sed "s/$STRING//"`
    #echo $RESULT

    SAVE="$BACKUP"`dirname "$RESULT"`
    #echo $SAVE
    mkdir -p "$SAVE"
    cp "$RESULT" "$SAVE"
    mv "$file" "$RESULT"


You can thank bash for saving your data. Stop bashing it and read this article instead.

Happy hacking,



InfluxDB as a graphite backend, part 2

The Graphite + InfluxDB series continues.

  • In part 1, "On Graphite, Whisper and InfluxDB" I described the problems of Graphite's whisper and ceres, why I disagree with common graphite clustering advice as being the right path forward, what a great timeseries storage system would mean to me, why InfluxDB - despite being the youngest project - is my main interest right now, and introduced my approach for combining both and leveraging their respective strengths: InfluxDB as an ingestion and storage backend (and at some point, realtime processing and pub-sub) and graphite for its renown data processing-on-retrieval functionality. Furthermore, I introduced some tooling: carbon-relay-ng to easily route streams of carbon data (metrics datapoints) to storage backends, allowing me to send production data to Carbon+whisper as well as InfluxDB in parallel, graphite-api, the simpler Graphite API server, with graphite-influxdb to fetch data from InfluxDB.
  • Not Graphite related, but I wrote influx-cli which I introduced here. It allows to easily interface with InfluxDB and measure the duration of operations, which will become useful for this article.
  • In the Graphite & Influxdb intermezzo I shared a script to import whisper data into InfluxDB and noted some write performance issues I was seeing, but the better part of the article described the various improvements done to carbon-relay-ng, which is becoming an increasingly versatile and useful tool.
  • In part 2, which you are reading now, I'm going to describe recent progress, share more info about my setup, testing results, state of affairs, and ideas for future work

Progress made

  • InfluxDB saw two major releases:
    • 0.7 (and followups), which was mostly about some needed features and bug fixes
    • 0.8 was all about bringing some major refactorings in the hands of early adopters/testers: support for multiple storage engines, configurable shard spaces, rollups and retention schemes. There was some other useful stuff like speed and robustness improvements for the graphite input plugin (by yours truly) and various things like regex filtering for 'list series'. Note that a bunch of older bugs remained open throughout this release (most notably the broken derivative aggregator), and a bunch of new ones appeared. Maybe this is why the release was mostly in the dark. In this context, it's not so bad, because we let graphite-api do all the processing, but if you want to query InfluxDB directly you might hit some roadblocks.
    • An older fix, but worth mentioning: series names can now also contain any character, which means you can easily use metrics2.0 identifiers. This is a welcome relief after having struggled with Graphite's restrictions on metric keys.
  • graphite-api received various bug fixes and support for templating, statsd instrumentation and caching.
    Much of this was driven by graphite-influxdb: the caching allows us to cache metadata and the statsd integration gives us insights into the performance of the steps it goes through of building a graph (getting metadata from InfluxDB, querying InfluxDB, interacting with cache, post processing data, etc).
  • the progress on InfluxDB and graphite-api in turn enabled graphite-influxdb to become faster and simpler (note: graphite-influxdb requires InfluxDB 0.8). Furthermore you can now configure series resolutions (but different retentions per serie is on the roadmap, see State of affairs and what's coming), and of course it also got a bunch of bugfixes.
Because of all these improvements, all involved components are now ready for serious use.

Putting it all together, with docker

Docker probably needs no introduction, it's a nifty tool to build an environment with given software installed, and allows to easily deploy it and run it in isolation. graphite-api-influxdb-docker is a very creatively named project that generates the - also very creatively named - docker image graphite-api-influxdb, which contains graphite-api and graphite-influxdb, making it easy to hook in a customized configuration and get it up and running quickly. This is the recommended way to set this up, and this is what we run in production.

The setup

  • a server running InfluxDB and graphite-api with graphite-influxdb via the docker approach described above:
    dell PowerEdge R610
    24 x Intel(R) Xeon(R) X5660  @ 2.80GHz
    96GB RAM
    perc raid h700
    6x600GB seagate 10k rpm drives in raid10 = 1.6 TB, Adaptive Read Ahead, Write Back, 64 kB blocks, no read caching
    no sharding/shard spaces, compiled from git just before 0.8, using LevelDB (not rocksdb, which is now the default)
    LevelDB max-open-files = 10000 (lsof shows about 30k open files total for the InfluxDB process), LRU 4096m, everything else is default I think.
  • a server running graphite-web, carbon, and whisper:
    dell PowerEdge R710
    16 x Intel(R) Xeon(R) E5640  @ 2.67GHz
    96GB RAM
    perc raid h700
    8x150GB seagate 15k rm in raid5 = 952 GB, Read Ahead, Write Back, 64 kB blocks, no read caching
    MAX_UPDATES_PER_SECOND = 1000  # to sequentialize writes
  • a relay server running carbon-relay-ng that sends the same production load into both. (about 2500 metrics/s, or 150k minutely)
As you can tell, on both machines RAM is vastly over provisioned, and they have lots of cpu available (the difference in cores should be negligible), but the difference in RAID level is important to note: RAID 5 comes with a write penalty. Even though the whisper machine has more, and faster disks, it probably has a disadvantage for writes. Maybe. Haven't done raid stuff in a long time, and I haven't it measured it out.
Clearly you'll need to take the results with a grain of salt, as unfortunately I do not have 2 systems available with the same configuration and their baseline (raw) performance is unknown..
Note: no InfluxDB clustering, see State of affairs and what's coming.

The empirical validation & migration

Once everything was setup and I could confidently send 100% of traffic to InfluxDB via carbon-relay-ng, it was trivial to run our dashboards with a flag deciding which server to go to. This way I have literally been running our graphite dashboards next to each other, allowing us to compare both stacks on:
  • visual differences: after a bunch of work and bug fixing, we got to a point where both dashboards looked almost exactly the same. (note that graphite-api's implementation of certain functions can behave slightly different, see for example this divideSeries bug)
  • speed differences by simply refreshing both pages and watching the PNGs load, with some assistance from firebug's network requests profiler. The difference here was big: graphs served up by graphite-api + InfluxDB loaded considerably faster. A page with 40 graphs or so would load in a few seconds instead of 20-30 seconds (on both first, as well as subsequent hits). This is for our default, 6-hour timeframe views. When cranking the timeframes up to a couple of weeks, graphite-api + InfluxDB was still faster.
Soon enough my colleagues started asking to make graphite-api + InfluxDB the default, as it was much faster in all common cases. I flipped the switch and everybody has been happy.

When loading a page with many dashboards, the InfluxDB machine will occasionally spike up to 500% cpu, though I rarely get to see any iowait (!), even after syncing the block cache (i just realized it'll probably still use the cache for reads after sync?)
The carbon/whisper machine, on the other hand, is always fighting iowait, which could be caused by the raid 5 write amplification but the random io due to the whisper format probably has more to do with it. Via the MAX_UPDATES_PER_SECOND I've tried to linearize writes, with mixed success. But I've never gone to deep into it. So basically comparing write performance would be unfair in these circumstances, I am only comparing reads in these tests. Despite the different storage setups, the Linux block cache should make things fair for reads. Whisper's iowait will handicap the reads, but I always did successive runs with fully loaded PNGs to make sure the block cache was warm for reads.

A "slightly more professional" benchmark

I could have stopped here, but the validation above was not very scientific. I wanted to do a somewhat more formal benchmark, to measure read speeds (though I did not have much time so it had to be quick and easy).
I wanted to compare InfluxDB vs whisper, and specifically how performance scales as you play with parameters such as number of series, points per series, and time range fetched (i.e. amount of points). I posted the benchmark on the InfluxDB mailing list. Look there for all information. I just want to reiterate the conclusion here: I was surprised. Because of the results above, I had assumed that InfluxDB would perform reads noticeably quicker than whisper but this is not the case. (maybe because whisper reads are nicely sequential - it's mostly writes that suffer from the whisper format)
This very much contrasts my earlier findings where the graphite-api+InfluxDB powered dashboards clearly take the lead. I have yet to figure out why this is. Maybe something to do with the performance of graphite-web vs graphite-api itself, gunicorn vs apache, worker configuration, or maybe InfluxDB only starts outperforming whisper as concurrency increases. Some more investigation is definitely needed!

Future benchmarks

The simple benchmark above was very simple to execute, as it only requires influx-cli and whisper-fetch (so you can easily check for yourself), but clearly there is a need to test more realistic scenarios with concurrent reads, and doing some write benchmarks would be nice too.
We should also look into cpu and memory usage. I have had the luxury of being able to completely ignore memory usage, but others seem to notice excessive InfluxDB memory usage.
I would also like to see storage efficiency tests. Last time I checked, using LevelDB I was pretty close to 24B per record (which makes sense because time, seq_no and value are all 64bit values, and each record has those 3 fields). (this was with snappy enabled, so it didn't seem to give much benefit). With whisper, I have files where the file size in Bytes divided by total records comes down to 114, for others 31. I haven't looked much into it but it looks like at least InfluxDB is more storage efficient. Also, whisper explicitly encodes None values of course, with InfluxDB those are implied (and require no space)

conclusion: many tests and benchmarks should happen, but I don't really have time to conduct them. Hopefully other people in the community will take this on.

State of affairs and what's coming

  • InfluxDB typically performs pretty well, but not in all cases. More validation is needed. It wouldn't surprise me at this point if tools like hbase/Cassandra/riak clearly outperform InfluxDB, as long as we keep in mind that InfluxDB is a young project. A year, or two, from now, it'll probably perform much better. (and then again, it's not all about raw performance. InfluxDB's has other strengths)
  • A long time goal which is now a reality: You can use any Graphite dashboard on top of InfluxDB, as long as the data is stored in a graphite-compatible format.. Again, the easiest to get running is via graphite-api-influxdb-docker. There are two issues to be mentioned, though:
  • With the 0.8 release out the door, the shard spaces/rollups/retention intervals feature will start stabilizing, so we can start supporting multiple retention intervals per metric
  • Because InfluxDB clustering is undergoing major changes, and because clustering is not a high priority for me, I haven't needed to worry about this. I'll probably only start looking at clustering somewhere in 2015 because I have more pressing issues.
  • Once the new clustering system and the storage subsystem have matured (sounds like a v1.0 ~ v1.2 to me) we'll get more speed improvements and robustness. Most of the integration work is done, it's just a matter of doing smaller improvements, bug fixes and waiting for InfluxDB to become better. Maintaining this stack aside, I personally will start focusing more on:
    • per-second resolution in our data feeds, and potentially storage
    • realtime (but basic) anomaly detection, realtime graphs for some key timeseries. Adrian Cockcroft had an inspirational piece in his Monitorama keynote about how alerts from timeseries should trigger within seconds.
    • Mozilla's awesome heka project (this heka video is great), which should help a lot with the above. Also looking at Etsy's kale stack for anomaly detection
    • metrics 2.0 and making sure metrics 2.0 works well with InfluxDB. Up to now I find the series / columns as a data model too limiting and arbitrary, it could be so much more powerful, ditto for the query language.
  • Can we do anything else to make InfluxDB (+graphite) faster? Yes!
    • Long term, of course, InfluxDB should have powerful enough processing functions and query syntax, so that we don't even need a graphite layer anymore.
    • A storage engine optimized for fixed intervals would probably help, we could have the timestamps implicit instead of explicit. And maybe making the sequence number field optional. Each of these fields currently consumes 1/3 of the record... The sequence number field is not only useless in the Graphite use case, I've also rarely seen people make use of this in other use cases. Not storing the values as 64bit floats would help too. Finally we could have InfluxDB have fill in None values without it doing "group by" (timeframe consolidation)
    • Then of course, there are projects to replace graphite-web/graphite-api with a Go codebase: graphite-ng and carbonapi. the latter is more production ready, but depends on some custom tooling and io using protobufs. But it performs an order of magnitude better than the python api server! I haven't touched graphite-ng in a while, but hopefully at some point I can take it up again
  • Another thing to keep in mind when switching to graphite-api + InfluxDB: you loose the graphite composer. I have a few people relying on this, so I can either patch it to talk to graphite-api (meh), separate it out (meh) or replace it with a nicer dashboard like tessera, grafana or descartes. (or Graph-Explorer, but it can be a bit too much of a paradigm shift).
  • some more InfluxDB stuff I'm looking forward to:
    • binary protocol and result streaming (faster communication and responses!) (the latter might not get implemented though)
    • "list series" speed improvements (if metadata querying gets fast enough, we won't need ES anymore for metrics2.0 index)
    • InfluxDB instrumentation so we actually start getting an idea of what's going on in the system, a lot of the testing and troubleshooting is still in the dark.
  • Tracking exceptions in graphite-api is much harder than it should be. Currently there's no way to display exceptions to the user (in the http response) or to even log them. So sometimes you'll get http 500 responses and don't know why. You can use the sentry integration which works all right, but is clunky. Hopefully this will be addressed soon.


The graphite-influxdb stack works and is ready for general consumption. It's easy to install and operate, and performs well. It is expected that InfluxDB will over time mature and ultimately meet all my requirements of the ideal backend. It definitely has a long way to go. More benchmarks and tests are needed. Keep in mind that we're not doing large volumes of metrics. For small/medium shops this solution should work well, but on larger scales you will definitely run into issues. You might conclude that InfluxDB is not for you (yet) (there are alternative projects, after all).

Finally, a closing thought:
Having graphs and dashboards that look nice and load fast is a good thing to have, but keep in mind that graphs and dashboards should be a last resort. It's a solution if all else fails. The fewer graphs you need, the better you're doing.
How can you avoid needing graphs? Automatic alerting on your data.

I see graphs as a temporary measure: they provide headroom while you develop an understanding of the operational behavior of your infrastructure, conceive a model of it, and implement the alerting you need to do troubleshooting and capacity planning. Of course, this process consumes more resources (time and otherwise), and these expenses are not always justifiable, but I think this is the ideal case we should be working towards.

Either way, good luck and have fun!

Graphite &amp; Influxdb intermezzo: migrating old data and a more powerful carbon relay

Migrating data from whisper into InfluxDB

"How do i migrate whisper data to influxdb" is a question that comes up regularly, and I've always replied it should be easy to write a tool to do this. I personally had no need for this, until a recent small influxdb outage where I wanted to sync data from our backup server (running graphite + whisper) to influxdb, so I wrote a script:
# whisper dir without trailing slash.
start=$(date -d 'sep 17 6am' +%s)
end=$(date -d 'sep 17 12pm' +%s)
pipe_path=$(mktemp -u)
mkfifo $pipe_path
function influx_updater() {
    influx-cli -db $db -async < $pipe_path
influx_updater &
while read wsp; do
  series=$(basename ${wsp////.} .wsp)
  echo "updating $series ..."
  whisper-fetch.py --from=$start --until=$end $wsp_dir/$wsp.wsp | grep -v 'None$' | awk '{print "insert into "'$series'" values ("$1"000,1,"$2")"}' > $pipe_path
done < <(find $wsp_dir -name '*.wsp' | sed -e "s#$wsp_dir/##" -e "s/.wsp$//")

It relies on the recently introduced asynchronous inserts feature of influx-cli - which commits inserts in batches to improve the speed - and the whisper-fetch tool.
You could probably also write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. But I wanted to keep it simple. Especially when I found out that whisper-fetch is not a bottleneck: starting whisper-fetch, and reading out - in my case - 360 datapoints of a file always takes about 50ms, whereas InfluxDB at first only needed a few ms to flush hundreds of records, but that soon increased to seconds.
Maybe it's a bug in my code, I didn't test this much, because I didn't need to; but people keep asking for a tool so here you go. Try it out and maybe you can fix a bug somewhere. Something about the write performance here must be wrong.

A more powerful carbon-relay-ng

carbon-relay-ng received a bunch of love and has been a great help in my graphite+influxdb experiments.

Here's what changed:
  • First I made it so that you can adjust routes at runtime while data is flowing through, via a telnet interface.
  • Then Paul O'Connor built an embedded web interface to manage your routes in an easier and prettier way (pictured above)
  • The relay now also emits performance metrics via statsd (I want to make this better by using go-metrics which will hopefully get expvar support at some point - any takers?).
  • Last but not least, I borrowed the diskqueue code from NSQ so now we can also spool to disk to bridge downtime of endpoints and re-fill them when they come back up
Beside our metrics storage, I also plan to put our anomaly detection (currently playing with heka and kale) and carbon-tagger behind the relay, centralizing all routing logic, making things more robust, and simplifying our system design. The spooling should also help to deploy to our metrics gateways at other datacenters, to bridge outages of datacenter interconnects.

I used to think of carbon-relay-ng as the python carbon-relay but on steroids, now it reminds me more of something like nsqd but with an ability to make packet routing decisions by introspecting the carbon protocol,
or perhaps Kafka but much simpler, single-node (no HA), and optimized for the domain of carbon streams.
I'd like the HA stuff though, which is why I spend some of my spare time figuring out the intricacies of the increasingly popular raft consensus algorithm. It seems opportune to have a simpler Kafka-like thing, in Go, using raft, for carbon streams. (note: InfluxDB might introduce such a component, so I'm also a bit waiting to see what they come up with)

Reminder: notably missing from carbon-relay-ng is round robin and sharding. I believe sharding/round robin/etc should be part of a broader HA design of the storage system, as I explained in On Graphite, Whisper and InfluxDB. That said, both should be fairly easy to implement in carbon-relay-ng, and I'm willing to assist those who want to contribute it.

Introducing: Oh My Vagrant!

If you’re a reader of my code or of this blog, it’s no secret that I hack on a lot of puppet and vagrant. Recently I’ve fooled around with a bit of docker, too. I realized that the vagrant, environments I built for puppet-gluster and puppet-ipa needed to be generalized, and they needed new features too. Therefore…

Introducing: Oh My Vagrant!

Oh My Vagrant is an attempt to provide an easy to use development environment so that you can be up and hacking quickly, and focusing on the real devops problems. The README explains my choice of project name.


I use a Fedora 20 laptop with vagrant-libvirt. Efforts are underway to create an RPM of vagrant-libvirt, but in the meantime you’ll have to read: Vagrant on Fedora with libvirt (reprise). This should work with other distributions too, but I don’t test them very often. Please step up and help test :)

The bits:

First clone the oh-my-vagrant repository and look inside:

git clone --recursive https://github.com/purpleidea/oh-my-vagrant
cd oh-my-vagrant/vagrant/

The included Vagrantfile is the current heart of this project. You’re welcome to use it as a template and edit it directly, or you can use the facilities it provides. I’d recommend starting with the latter, which I’ll walk you through now.

Getting started:

Start by running vagrant status (vs) and taking a look at the vagrant.yaml file that appears.

james@computer:/oh-my-vagrant/vagrant$ ls
Dockerfile  puppet/  Vagrantfile
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)

The Libvirt domain is not created. Run `vagrant up` to create it.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
:domain: example.com
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms: []
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []

Here you’ll see the list of resultant machines that vagrant thinks is defined (currently just template1), and a bunch of different settings in YAML format. The values of these settings help define the vagrant environment that you’ll be hacking in.

Changing settings:

The settings exist so that your vagrant environment is dynamic and can be changed quickly. You can change the settings by editing the vagrant.yaml file. They will be used by vagrant when it runs. You can also change them at runtime with --vagrant-foo flags. Running a vagrant status will show you how vagrant currently sees the environment. Let’s change the number of machines that are defined. Note the location of the --vagrant-count flag and how it doesn’t work when positioned incorrectly.

james@computer:/oh-my-vagrant/vagrant$ vagrant status --vagrant-count=4
An invalid option was specified. The help for this command
is available below.

Usage: vagrant status [name]
    -h, --help                       Print this help
james@computer:/oh-my-vagrant/vagrant$ vagrant --vagrant-count=4 status
Current machine states:

template1                 not created (libvirt)
template2                 not created (libvirt)
template3                 not created (libvirt)
template4                 not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
:domain: example.com
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms: []
:namespace: template
:count: 4
:username: ''
:password: ''
:poolid: []
:repos: []

As you can see in the above example, changing the count variable to 4, causes vagrant to see a possible four machines in the vagrant environment. You can change as many of these parameters at a time by using the --vagrant- flags, or you can edit the vagrant.yaml file. The latter is much easier and more expressive, in particular for expressing complex data types. The former is much more powerful when building one-liners, such as:

vagrant --vagrant-count=8 --vagrant-namespace=gluster up gluster{1..8}

which should bring up eight hosts in parallel, named gluster1 to gluster8.

Other VM’s:

Since one often wants to be more expressive in machine naming and heterogeneity of machine type, you can specify a list of machines to define in the vagrant.yaml file vms array. If you’d rather define these machines in the Vagrantfile itself, you can also set them up in the vms array defined there. It is empty by default, but it is easy to uncomment out one of the many examples. These will be used as the defaults if nothing else overrides the selection in the vagrant.yaml file. I’ve uncommented a few to show you this functionality:

james@computer:/oh-my-vagrant/vagrant$ grep example[124] Vagrantfile 
    {:name => 'example1', :docker => true, :puppet => true, },    # example1
    {:name => 'example2', :docker => ['centos', 'fedora'], },    # example2
    {:name => 'example4', :image => 'centos-6', :puppet => true, },    # example4
james@computer:/oh-my-vagrant/vagrant$ rm vagrant.yaml # note that I remove the old settings
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)
example1                  not created (libvirt)
example2                  not created (libvirt)
example4                  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
:domain: example.com
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
- :name: example1
  :docker: true
  :puppet: true
- :name: example2
  - centos
  - fedora
- :name: example4
  :image: centos-6
  :puppet: true
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$ vim vagrant.yaml # edit vagrant.yaml file...
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
:domain: example.com
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
- :name: example1
  :docker: true
  :puppet: true
- :name: example4
  :image: centos-7.0
  :puppet: true
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)
example1                  not created (libvirt)
example4                  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.

The above output might seem a little long, but if you try these steps out in your terminal, you should get a hang of it fairly quickly. If you poke around in the Vagrantfile, you should see the format of the vms array. Each element in the array should be a dictionary, where the keys correspond to the flags you wish to set. Look at the examples if you need help with the formatting.

Other settings:

As you saw, other settings are available. There are a few notable ones that are worth mentioning. This will also help explain some of the other features that this Vagrantfile provides.

  • domain: This sets the domain part of each vm’s FQDN. The default is example.com, which should work for most environments, but you’re welcome to change this as you see fit.
  • network: This sets the network that is used for the vm’s. You should pick a network/cidr that doesn’t conflict with any other networks on your machine. This is particularly useful when you have multiple vagrant environments hosted off of the same laptop.
  • image: This is the default base image to use for each machine. It can be overridden per-machine in the vm’s list of dictionaries.
  • sync: This is the sync type used for vagrant. rsync is the default and works in all environments. If you’d prefer to fight with the nfs mounts, or try out 9p, both those options are available too.
  • puppet: This option enables or disables integration with puppet. It is possible to override this per machine. This functionality will be expanded in a future version of Oh My Vagrant.
  • docker: This option enables and lists the docker images to set up per vm. It is possible to override this per machine. This functionality will be expanded in a future version of Oh My Vagrant.
  • namespace: This sets the namespace that your Vagrantfile operates in. This value is used as a prefix for the numbered vm’s, as the libvirt network name, and as the primary puppet module to execute.

More on the docker option:

For now, if you specify a list of docker images, they will be automatically pulled into your vm environment. It is recommended that you pre-cache them in an existing base image to save bandwidth. Custom base vagrant images can be easily be built with vagrant-builder, but this process is currently undocumented.

I’ll try to write-up a post on this process if there are enough requests. To keep you busy in the meantime, I’ve published a CentOS 7 vagrant base image that includes docker images for CentOS and Fedora. It is being graciously hosted by the GlusterFS community.

What other magic does this all do?

There is a certain amount of magic glue that happens behind the scenes. Here’s a list of some of it:

  • Idempotent /etc/hosts based DNS
  • Easy docker base image installation
  • IP address calculations and assignment with ipaddr
  • Clever cleanup on ‘vagrant destroy
  • Vagrant docker base image detection
  • Integration with Puppet

If you don’t understand what all of those mean, and you don’t want to go source diving, don’t worry about it! I will explain them in greater detail when it’s important, and hopefully for now everything “just works” and stays out of your way.

Future work:

There’s still a lot more that I have planned, and some parts of the Vagrantfile need clean up, but I figured I’d try and release this early so that you can get hacking right away. If it’s useful to you, please leave a comment and let me know.

Happy hacking,



Translations Between Domains: David Woods

One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how do people cope with increasing amounts of complexity.

As a result, my journey took me deep into a world where I immediately saw connections — between concepts found in other high-tempo, high-consequence domains and my own world of software engineering and operations. One of the first connections was in Richard Cook’s How Complex Systems Fail, and it struck me so deeply I insisted that it get reprinted (with additions by Richard) into O’Reilly’s Web Operations book.

I simply cannot un-see these connections now, and the field of study keeps me going deeper. So deep that I felt I needed to get a degree. My goal with getting a degree in the topic is not just to satisfy my own curiosity, but also to explore these topics in sufficient depth to feel credible in thinking about them critically.

In software, the concept and sometimes inadvertent practice of “cargo cult engineering” is well known. I’m hoping to avoid that in my own translation(s) of what’s been found in human factors, safety science, and cognitive systems engineering, as they looked into domains like aviation, patient safety, or power plant operations. Instead, I’m looking to truly understand that work in order to know what to focus on in my own research as well as to understand how my domain is either similar (and in what ways?) or different (and in what ways?)

For example, just a hint of what sorts of questions I have been mulling over:

  • How does the concept of “normalization of deviance” manifest in web engineering? How does it relate to our concept of ‘technical debt’?
  • What organizational dynamics might be in play when it comes to learning from “successes” and “failures”?
  • What methods of inquiry can we use to better design interfaces that have functionality and safety and diagnosis support as their core? Or, are those goals in conflict? If so, how?
  • How can we design alerts to reduce noise and increase signal in a way that takes into account the context of the intended receiver of the alert? In other words, how can we teach alerts to know about us, instead of the other way around?
  • The Internet (include its technical, political, and cultural structures) has non-zero amounts of diversity, interdependence, connectedness, and adaptation, which by many measures constitutes a complex system.
  • How do successful organizations navigate trade-offs when it comes to decisions that may have unexpected consequences?

I’ve done my best to point my domain at some of these connections as I understand them, and the Velocity Conference has been one of the ways I’ve hoped to bring people “over the bridge” from Safety Science, Human Factors, and Cognitive Systems Engineering into software engineering and operations as it exists as a practice on Internet-connected resources. If you haven’t seen Dr. Richard Cook’s 2012 and 2013 keynotes, or Dr. Johan Bergstrom’s keynote, stop what you’re doing right now and watch them.

I’m willing to bet you’ll see connections immediately…

DavidWoodsDavid Woods is one of the pioneers in these fields, and continues to be a huge influence on the way that I think about our domain and my own research (my thesis project relies heavily on some of his previous work) and I can’t be happier that he’s speaking at Velocity in New York, which is coming up soon. (Pssst: if you register for it here, you can use the code “JOHN20″ for 20% discount)

I have posted before (and likely will again) about a paper Woods contributed to, Common Ground and Coordination in Joint Activity (Klein, Feltovich, Bradshaw, & Woods, 2005) which in my mind might as well be considered the best explanation on what “devops” means to me, and what makes successful teams work. If you haven’t read it, do it now.


Dynamic Fault Management and Anomaly Response

I thought about listing all of Woods’ work that I’ve seen connections in thus far, but then I realized that if I wasn’t careful, I’d be writing a literature review and not a blog post. :) Also, I have thesis work to do. So for now, I’d like to point only at two concepts that struck me as absolutely critical to the day-to-day of many readers of this blog, dynamic fault management and anomaly response.

Woods sheds some light on these topics in Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. Pay particular attention to the characteristics of these phenomenons:

“In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity—what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem.

Anomaly response situations frequently involve time pressure, multiple interacting goals, high consequences of failure, and multiple interleaved tasks (Woods, 1988; 1994). Typical examples of fields of practice where dynamic fault management occurs include flight deck operations in commercial aviation (Abbott, 1990), control of space systems (Patterson et al., 1999; Mark, 2002), anesthetic management under surgery (Gaba et al., 1987), terrestrial process control (Roth, Woods & Pople, 1992), and response to natural disasters.” (Woods & Hollnagel, 2006, p.71)

Now look down at the distributed systems you’re designing and operating.

Look at the “runbooks” and postmortem notes that you have written in the hopes that they can help guide teams as they try to untangle the sometimes very confusing scenarios that outages can bring.

Does “safing” ring familiar to you?

Do you recognize managing “multiple interleaved tasks” under “time pressure” and “high consequences of failure”?

I think it’s safe to say that almost every Velocity Conference attendee would see connections here.

In How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands (Woods & Patterson, 1999), he introduces the concept of escalation, in terms of anomaly response:

The concept of escalation captures a dynamic relationship between the cascade of effects that follows from an event and the demands for cognitive and collaborative work that escalate in response (Woods, 1994). An event triggers the evolution of multiple interrelated dynamics.

  • There is a cascade of effects in the monitored process. A fault produces a time series of disturbances along lines of functional and physical coupling in the process (e.g., Abbott, 1990). These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications (Reiersen, Marshall, & Baker, 1988).
  • Demands for cognitive activity increase as the problem cascades. More knowledge potentially needs to be brought to bear. There is more to monitor. There is a changing set of data to integrate into a coherent assessment. Candidate hypotheses need to be generated and evaluated. Assessments may need to be revised as new data come in. Actions to protect the integrity and safety of systems need to be identified, carried out, and monitored for success. Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.
  • Demands for coordination increase as the problem cascades. As the cognitive activities escalate, the demand for coordination across people and across people and machines rises. Knowledge may reside in different people or different parts of the operational system. Specialized knowledge and expertise from other parties may need to be brought into the problem-solving process. Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process. The trouble in the underlying process requires informing and updating others – those whose scope of responsibility may be affected by the anomaly, those who may be able to support recovery, or those who may be affected by the consequences the anomaly could or does produce.
  • The cascade and escalation is a dynamic process. A variety of complicating factors can occur, which move situations beyond canonical, textbook forms. The concept of escalation captures this movement from canonical to nonroutine to exceptional. The tempo of operations increases following the recognition of a triggering event and is synchronized by temporal landmarks that represent irreversible decision points.

When I read…

“These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications” 

I think of many large-scale outages and multi-day recovery activities, like this one that you all might remember (AWS EBS/RDS outage, 2011).

When I read…

“Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.” 

I think of many outage response scenarios I have been in with multiple teams (network, storage, database, security, etc.) gathering data from places they

When I read…

“Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process.”

I think of these two particular outages, and how in the fog of ambiguous signals coming in during diagnosis of an issue, there is a “divide and conquer” effort distributed throughout differing domain expertise (database, network, various software layers, hardware, etc.) that aims to split the search space of diagnosis, while at the same time keeping each other up-to-date on what pathologies have been eliminated as possibilities, what new data can be used to form hypotheses about what’s going on, etc.

I will post more on the topic of anomaly response in detail (and more of Woods’ work) in another post.

In the meantime, I urge you to take a look at David Woods’ writings, and look for connections in your own work. Below is a talk David gave at IBM’s Almaden Research Center, called “Creating Safety By Engineering Resilience”:

David D. Woods, Creating Safety by Engineering Resilience from jspaw on Vimeo.


Hollnagel, E., & Woods, D. D. (1983). Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.

Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 139–184.

Woods, D. D. (1995). The alarm problem and directed attention in dynamic fault management. Ergonomics. doi:10.1080/00140139508925274

Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems : patterns in cognitive systems engineering. Boca Raton : CRC/Taylor & Francis.

Woods, D. D., & Patterson, E. S. (1999). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands. Stress, 1–13.

Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis. Cognition, Technology & Work, 4(1), 22–36. doi:10.1007/s101110200002