↓ Archives ↓

Hacking out an Openshift app

I had an itch to scratch, and I wanted to get a bit more familiar with Openshift. I had used it in the past, but it was time to have another go. The app and the code are now available. Feel free to check out:

https://pdfdoc-purpleidea.rhcloud.com/

This is a simple app that takes the URL of a markdown file on GitHub, and outputs a pandoc converted PDF. I wanted to use pandoc specifically, because it produces PDF’s that were beautifully created with LaTeX. To embed a link in your upstream documentation that points to a PDF, just append the file’s URL to this app’s url, under a /pdf/ path. For example:

https://pdfdoc-purpleidea.rhcloud.com/pdf/https://github.com/purpleidea/puppet-gluster/blob/master/DOCUMENTATION.md

will send you to a PDF of the puppet-gluster documentation. This will make it easier to accept questions as FAQ patches, without needing to have the git embedded binary PDF be constantly updated.

If you want to hear more about what I did, read on…

The setup:

Start by getting a free Openshift account. You’ll also want to install the client tools. Nothing is worse than having to interact with your app via a web interface. Hackers use terminals. Lucky, the Openshift team knows this, and they’ve created a great command line tool called rhc to make it all possible.

I started by following their instructions:

$ sudo yum install rubygem-rhc
$ sudo gem update rhc

Unfortunately, this left with a problem:

$ rhc
/usr/share/rubygems/rubygems/dependency.rb:298:in `to_specs': Could not find 'rhc' (>= 0) among 37 total gem(s) (Gem::LoadError)
    from /usr/share/rubygems/rubygems/dependency.rb:309:in `to_spec'
    from /usr/share/rubygems/rubygems/core_ext/kernel_gem.rb:47:in `gem'
    from /usr/local/bin/rhc:22:in `'

I solved this by running:

$ gem install rhc

Which makes my user rhc to take precedence over the system one. Then run:

$ rhc setup

and the rhc client will take you through some setup steps such as uploading your public ssh key to the Openshift infrastructure. The beauty of this tool is that it will work with the Red Hat hosted infrastructure, or you can use it with your own infrastructure if you want to host your own Openshift servers. This alone means you’ll never get locked in to a third-party providers terms or pricing.

Create a new app:

To get a fresh python 3.3 app going, you can run:

$ rhc create-app <appname> python-3.3

From this point on, it’s fairly straight forward, and you can now hack your way through the app in python. To push a new version of your app into production, it’s just a git commit away:

$ git add -p && git commit -m 'Awesome new commit...' && git push && rhc tail

Creating a new app from existing code:

If you want to push a new app from an existing code base, it’s as easy as:

$ rhc create-app awesomesauce python-3.3 --from-code https://github.com/purpleidea/pdfdoc
Application Options
-------------------
Domain:      purpleidea
Cartridges:  python-3.3
Source Code: https://github.com/purpleidea/pdfdoc
Gear Size:   default
Scaling:     no

Creating application 'awesomesauce' ... done


Waiting for your DNS name to be available ... done

Cloning into 'awesomesauce'...
The authenticity of host 'awesomesauce-purpleidea.rhcloud.com (203.0.113.13)' can't be established.
RSA key fingerprint is 00:11:22:33:44:55:66:77:88:99:aa:bb:cc:dd:ee:ff.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'awesomesauce-purpleidea.rhcloud.com,203.0.113.13' (RSA) to the list of known hosts.

Your application 'awesomesauce' is now available.

  URL:        http://awesomesauce-purpleidea.rhcloud.com/
  SSH to:     00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com
  Git remote: ssh://00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com/~/git/awesomesauce.git/
  Cloned to:  /home/james/code/awesomesauce

Run 'rhc show-app awesomesauce' for more details about your app.

In my case, my app also needs some binaries installed. I haven’t yet automated this process, but I think it can be done be creating a custom cartridge. Help to do this would be appreciated!

Updating your app:

In the case of an app that I already deployed with this method, updating it from the upstream source is quite easy. You just pull down and relevant commits, and then push them up to your app’s git repo:

$ git pull upstream master 
From https://github.com/purpleidea/pdfdoc
 * branch            master     -> FETCH_HEAD
Updating 5ac5577..bdf9601
Fast-forward
 wsgi.py | 2 --
 1 file changed, 2 deletions(-)
$ git push origin master 
Counting objects: 7, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 312 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Stopping Python 3.3 cartridge
remote: Waiting for stop to finish
remote: Waiting for stop to finish
remote: Building git ref 'master', commit bdf9601
remote: Activating virtenv
remote: Checking for pip dependency listed in requirements.txt file..
remote: You must give at least one requirement to install (see "pip help install")
remote: Running setup.py script..
remote: running develop
remote: running egg_info
remote: creating pdfdoc.egg-info
remote: writing pdfdoc.egg-info/PKG-INFO
remote: writing dependency_links to pdfdoc.egg-info/dependency_links.txt
remote: writing top-level names to pdfdoc.egg-info/top_level.txt
remote: writing manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: reading manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: writing manifest file 'pdfdoc.egg-info/SOURCES.txt'
remote: running build_ext
remote: Creating /var/lib/openshift/00112233445566778899aabb/app-root/runtime/dependencies/python/virtenv/venv/lib/python3.3/site-packages/pdfdoc.egg-link (link to .)
remote: pdfdoc 0.0.1 is already the active version in easy-install.pth
remote: 
remote: Installed /var/lib/openshift/00112233445566778899aabb/app-root/runtime/repo
remote: Processing dependencies for pdfdoc==0.0.1
remote: Finished processing dependencies for pdfdoc==0.0.1
remote: Preparing build for deployment
remote: Deployment id is 9c2ee03c
remote: Activating deployment
remote: Starting Python 3.3 cartridge (Apache+mod_wsgi)
remote: Application directory "/" selected as DocumentRoot
remote: Application "wsgi.py" selected as default WSGI entry point
remote: -------------------------
remote: Git Post-Receive Result: success
remote: Activation status: success
remote: Deployment completed with status: success
To ssh://00112233445566778899aabb@awesomesauce-purpleidea.rhcloud.com/~/git/awesomesauce.git/
   5ac5577..bdf9601  master -> master
$

Final thoughts:

I hope this helped you getting going with Openshift. Feel free to send me patches!

Happy hacking!

James


Supporting Millions of Pretty URL Rewrites in Nginx with Lua and Redis

About a year ago, I was tasked with greatly expanding our url rewrite capabilities. Our file based, nginx rewrites were becoming a performance bottleneck and we needed to make an architectural leap to that would take us to the next level of SEO wizardry.

In comparison to the total number of product categories in our database, Stylight supports a handful of “pretty URLs” – those understandable by a human being. Take http://www.stylight.com/Sandals/Women/ – pretty obvious what’s going to be on that page, right?

Our web application, however, only understands http://www.stylight.com/search.action?gender=women&tag=10580&tag=10630. So, nginx needs to translate pretty URLs into something our app can find, fetch and return to your page. And this needs to happen as fast as computationally possible. Click on that link and you’ll notice we redirect you to the pretty URL. This is because we’ve found out women really love sandals so we want to give them a page they’d like to bookmark.

We import and update millions of products a day, so the vast majority of our links start out as “?tag=10580″. Googlebot knows how dynamic our site is, so it’s constantly crawling and indexing these functional links to feed its search results. As we learn from our users and ad campaigns which products are really interesting, we dynamically assign pretty URLs and inform Google with 301 redirects.

This creates 2 layers of redirection and doubles the urls our webserver needs to know about:

  • 301 redirects for the user (and search engines): ?gender=women&tag=10580&tag=10630 -> /Sandals/Women/
  • internal rewrites for our app: /Sandals/Women/ -> ?gender=women&tag=10580&tag=10630

So, how can we provide millions of pretty URLs to showcase all facets of our product search results?

The problem with file based, nginx rewrites: memory & reload times

With 800K rewrites and redirects (or R&Rs for short) in over 12 country rewrite.conf files, our “next level” initially means about ~8 million R&Rs urls. But we could barely cope our current requirements.

File based R&Rs are loaded into memory for all 16 nginx workers. Besides 3GB of RAM, it took almost 5 seconds just to reload or restart nginx! As a quick test, I doubled the amount of rewrites for one country. 20 seconds later nginx was successfully reload and running with 3.5GB of memory. Talk about “scale fail”.

What are the alternatives?

Google searching for nginx with millions of rewrites or redirects didn’t give a whole lot of insight, but digging through what I found eventually led me to OpenResty. Not being a full-time sysadmin, I don’t care to build and maintain custom binaries.

My next search for OpenResty on Ubuntu Trusty led me to lua-nginx-redis – perhaps not the most performant solution, but I’d take the compromise for community supported patches. A sudo apt-get install lua-nginx-redis gave us the basis for our new architecture.

As an initial test, I copied our largest country’s rewrites into redis, made a quick lua script for handling the rewrites and made my first head-to-head test:
redis-vs-nginx

I included network round trip times in my test to get an idea of the complete performance improvement we hoped to realize with this re-architecture. Interesting how quite a few URLs (those towards the bottom of the rewrite file) caused significant spikes in respone times. From these initial results, we decided to make the investment and completely overhaul our rewrite and redirect infrastructure.

The 301 redirects lived exclusively on the frontend load balancers while the internal rewrites were handled by our app servers. First order of business would be to combine these, leaving the application to concentrate on just serving requests. Next, we set up a cronjob to incrementally update R&Rs every 5 minutes. I gave the R&Rs a TTL of one month to keep the redis db tidy. Weekly, we run full insert which resets the TTL. And, yes, we monitor the TTLs of our R&Rs – don’t want all them disappearing over night!

The performance of Lua and Redis

We launched the new solution in the middle of July this year – just over three months ago.
And our average response time during the same period:
pingdom_response_time

As you can see, despite rapidly growing traffic, we saw the first significant improvements to our site’s response time just by moving the R&Rs out of files and into redis. Reload times for nginx are instant – there are no more rewrites it to load and distribute per worke – and memory usage has dropped below 900MB.

Since the launch, we’ve double our number of R&Rs (checkout how the memory scales):
redis_keys_memory_growth

Soon we’ll be able to serve all our URLs like http://www.stylight.com/Dark-Green/Long-Sleeve/T-Shirts/Gap/Men/ by default. No, we’re not quite there yet, but if you need that kinda shirt

We’ve got a lot of SEO work ahead of us which will require millions more rewrites. And now we have a performant architecture which will support it. If you have any questions or would like to know more details, don’t hesitate to contact me @danackerson.

Continuous integration for Puppet modules

I just patched puppet-gluster and puppet-ipa to bring their infrastructure up to date with the current state of affairs…

What’s new?

  • Better README’s
  • Rake syntax checking (fewer oopsies)
  • CI (testing) with travis on git push (automatic testing for everyone)
  • Use of .pmtignore to ignore files from puppet module packages (finally)
  • Pushing modules to the forge with blacksmith (sweet!)

This last point deserves another mention. Puppetlabs created the “forge” to try to provide some sort of added value to their stewardship. Personally, I like to look for code on github instead, but nevertheless, some do use the forge. The problem is that to upload new releases, you need to click your mouse like a windows user! Someone has finally solved that problem! If you use blacksmith, a new build is just a rake push away!

Have a look at this example commit if you’re interested in seeing the plumbing.

Better documentation and FAQ answering:

I’ve answered a lot of questions by email, but this only helps out individuals. From now on, I’d appreciate if you asked your question in the form of a patch to my FAQ. (puppet-gluster, puppet-ipa)

I’ll review and merge your patch, including a follow-up patch with the answer! This way you’ll get more familiar with git and sending small patches, everyone will benefit from the response, and I’ll be able to point you to the docs (and even a specific commit) to avoid responding to already answered questions. You’ll also have the commit information of something else who already had this problem. Cool, right?

Happy hacking,

James


A real whisper-to-InfluxDB program.

The whisper-to-influxdb migration script I posted earlier is pretty bad. A shell script, without concurrency, and an undiagnosed performance issue. I hinted that one could write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. That's what I did now, it's at github.com/vimeo/whisper-to-influxdb. It uses configurable amounts of workers for both whisper fetches and InfluxDB commits, but it's still a bit naive in the sense that it commits to InfluxDB one serie at a time, irrespective of how many records are in it. My series, and hence my commits have at most 60k records, and presumably InfluxDB could handle a lot more per commit, so we might leverage better batching later. Either way, this way I can consistently commit about 100k series every 2.5 hours (or 10/s), where each serie has a few thousand points on average, with peaks up to 60k points. I usually play with 1 to 30 InfluxDB workers. Even though I've hit a few InfluxDB issues, this tool has enabled me to fill in gaps after outages and to do a restore from whisper after a complete database wipe.

Fixing dropbox “conflicted copy” problems

I usually avoid proprietary cloud services because of freedom, privacy and vendor lock-in concerns. In addition, there are some excellent libre (and hosted) services such as WordPress, Wikipedia and OpenShift which don’t have the above problems. Thirdly, there are every day Free Software tools such as Fedora GNU/Linux, Libreoffice, and git-annex-assistant which make my computing much more powerful. Finally, there are some hosted services that I use that don’t lock me in because I use them as push-only mirrors, and I only interact with them using Free Software tools. The two examples are GitHub and Dropbox.

Today, Dropbox bit me. Here’s how I saved my data.

Dropbox integrates with GNOME‘s nautilus to sync your data to their proprietary cloud hosting. I periodically run the dropbox client to sync any changes to my public files up to their servers. Today, the client decided that some of my newer files were older than the stored server-side versions, and promptly over-wrote my newer versions.

Thankfully I have real backups, and, to be fair, Dropbox actually renamed my newer files instead of blatantly clobbering them. My filesystem now looks like this:

$ tree files/
files/
|-- bar
|-- baz
|   |-- file1
|   |-- file1 (james's conflicted copy 2014-09-29)
|   |-- file2 (james's conflicted copy 2014-09-29).sh
|   `-- file2.sh
`-- foo
    `-- magic.sh

You’ll note that my previously clean file system now has the “conflicted copy” versions everywhere. These are the good versions, whereas in the example above file1 and file2.sh are the older unwanted versions.

I spent some time with find and diff convincing myself that this was true, and eventually I wrote a script. The script looks through the current working directory for “conflicted copy” matches, saves the unwanted versions (just in case) and then clobbers them with the good “conflicted” version.

Please look through, edit, and understand this script before running it. It might not be what you want, and it was designed to only work for me. It is available as a gist, and below in the body of this article.

$ cat fix-dropbox.sh 
#!/bin/bash

# XXX: use at your own risk - do not run without understanding this first!
exit 1

# safety directory
BACKUP='/tmp/fix-dropbox/'

# TODO: detect or pick manually...
NAME=`hostname`
#NAME='myhostname'
DATE='2014-09-29'

mkdir -p "$BACKUP"
find . -path "*(*'s conflicted copy [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*" -print0 | while read -d $'' -r file; do
    printf 'Found: %sn' "$file"

    # TODO: detect or pick manually...
    #NAME='XXX'
    #DATE='2014-09-29'

    STRING=" (${NAME}'s conflicted copy ${DATE})"
    #echo $STRING
    RESULT=`echo "$file" | sed "s/$STRING//"`
    #echo $RESULT

    SAVE="$BACKUP"`dirname "$RESULT"`
    #echo $SAVE
    mkdir -p "$SAVE"
    cp "$RESULT" "$SAVE"
    mv "$file" "$RESULT"

done

You can thank bash for saving your data. Stop bashing it and read this article instead.

Happy hacking,

James

 


InfluxDB as a graphite backend, part 2

The Graphite + InfluxDB series continues.

  • In part 1, "On Graphite, Whisper and InfluxDB" I described the problems of Graphite's whisper and ceres, why I disagree with common graphite clustering advice as being the right path forward, what a great timeseries storage system would mean to me, why InfluxDB - despite being the youngest project - is my main interest right now, and introduced my approach for combining both and leveraging their respective strengths: InfluxDB as an ingestion and storage backend (and at some point, realtime processing and pub-sub) and graphite for its renown data processing-on-retrieval functionality. Furthermore, I introduced some tooling: carbon-relay-ng to easily route streams of carbon data (metrics datapoints) to storage backends, allowing me to send production data to Carbon+whisper as well as InfluxDB in parallel, graphite-api, the simpler Graphite API server, with graphite-influxdb to fetch data from InfluxDB.
  • Not Graphite related, but I wrote influx-cli which I introduced here. It allows to easily interface with InfluxDB and measure the duration of operations, which will become useful for this article.
  • In the Graphite & Influxdb intermezzo I shared a script to import whisper data into InfluxDB and noted some write performance issues I was seeing, but the better part of the article described the various improvements done to carbon-relay-ng, which is becoming an increasingly versatile and useful tool.
  • In part 2, which you are reading now, I'm going to describe recent progress, share more info about my setup, testing results, state of affairs, and ideas for future work

Progress made

  • InfluxDB saw two major releases:
    • 0.7 (and followups), which was mostly about some needed features and bug fixes
    • 0.8 was all about bringing some major refactorings in the hands of early adopters/testers: support for multiple storage engines, configurable shard spaces, rollups and retention schemes. There was some other useful stuff like speed and robustness improvements for the graphite input plugin (by yours truly) and various things like regex filtering for 'list series'. Note that a bunch of older bugs remained open throughout this release (most notably the broken derivative aggregator), and a bunch of new ones appeared. Maybe this is why the release was mostly in the dark. In this context, it's not so bad, because we let graphite-api do all the processing, but if you want to query InfluxDB directly you might hit some roadblocks.
    • An older fix, but worth mentioning: series names can now also contain any character, which means you can easily use metrics2.0 identifiers. This is a welcome relief after having struggled with Graphite's restrictions on metric keys.
  • graphite-api received various bug fixes and support for templating, statsd instrumentation and caching.
    Much of this was driven by graphite-influxdb: the caching allows us to cache metadata and the statsd integration gives us insights into the performance of the steps it goes through of building a graph (getting metadata from InfluxDB, querying InfluxDB, interacting with cache, post processing data, etc).
  • the progress on InfluxDB and graphite-api in turn enabled graphite-influxdb to become faster and simpler (note: graphite-influxdb requires InfluxDB 0.8). Furthermore you can now configure series resolutions (but different retentions per serie is on the roadmap, see State of affairs and what's coming), and of course it also got a bunch of bugfixes.
Because of all these improvements, all involved components are now ready for serious use.

Putting it all together, with docker

Docker probably needs no introduction, it's a nifty tool to build an environment with given software installed, and allows to easily deploy it and run it in isolation. graphite-api-influxdb-docker is a very creatively named project that generates the - also very creatively named - docker image graphite-api-influxdb, which contains graphite-api and graphite-influxdb, making it easy to hook in a customized configuration and get it up and running quickly. This is the recommended way to set this up, and this is what we run in production.

The setup

  • a server running InfluxDB and graphite-api with graphite-influxdb via the docker approach described above:
    dell PowerEdge R610
    24 x Intel(R) Xeon(R) X5660  @ 2.80GHz
    96GB RAM
    perc raid h700
    6x600GB seagate 10k rpm drives in raid10 = 1.6 TB, Adaptive Read Ahead, Write Back, 64 kB blocks, no read caching
    no sharding/shard spaces, compiled from git just before 0.8, using LevelDB (not rocksdb, which is now the default)
    LevelDB max-open-files = 10000 (lsof shows about 30k open files total for the InfluxDB process), LRU 4096m, everything else is default I think.
    
  • a server running graphite-web, carbon, and whisper:
    dell PowerEdge R710
    16 x Intel(R) Xeon(R) E5640  @ 2.67GHz
    96GB RAM
    perc raid h700
    8x150GB seagate 15k rm in raid5 = 952 GB, Read Ahead, Write Back, 64 kB blocks, no read caching
    MAX_UPDATES_PER_SECOND = 1000  # to sequentialize writes
    
  • a relay server running carbon-relay-ng that sends the same production load into both. (about 2500 metrics/s, or 150k minutely)
As you can tell, on both machines RAM is vastly over provisioned, and they have lots of cpu available (the difference in cores should be negligible), but the difference in RAID level is important to note: RAID 5 comes with a write penalty. Even though the whisper machine has more, and faster disks, it probably has a disadvantage for writes. Maybe. Haven't done raid stuff in a long time, and I haven't it measured it out.
Clearly you'll need to take the results with a grain of salt, as unfortunately I do not have 2 systems available with the same configuration and their baseline (raw) performance is unknown..
Note: no InfluxDB clustering, see State of affairs and what's coming.

The empirical validation & migration

Once everything was setup and I could confidently send 100% of traffic to InfluxDB via carbon-relay-ng, it was trivial to run our dashboards with a flag deciding which server to go to. This way I have literally been running our graphite dashboards next to each other, allowing us to compare both stacks on:
  • visual differences: after a bunch of work and bug fixing, we got to a point where both dashboards looked almost exactly the same. (note that graphite-api's implementation of certain functions can behave slightly different, see for example this divideSeries bug)
  • speed differences by simply refreshing both pages and watching the PNGs load, with some assistance from firebug's network requests profiler. The difference here was big: graphs served up by graphite-api + InfluxDB loaded considerably faster. A page with 40 graphs or so would load in a few seconds instead of 20-30 seconds (on both first, as well as subsequent hits). This is for our default, 6-hour timeframe views. When cranking the timeframes up to a couple of weeks, graphite-api + InfluxDB was still faster.
Soon enough my colleagues started asking to make graphite-api + InfluxDB the default, as it was much faster in all common cases. I flipped the switch and everybody has been happy.

When loading a page with many dashboards, the InfluxDB machine will occasionally spike up to 500% cpu, though I rarely get to see any iowait (!), even after syncing the block cache (i just realized it'll probably still use the cache for reads after sync?)
The carbon/whisper machine, on the other hand, is always fighting iowait, which could be caused by the raid 5 write amplification but the random io due to the whisper format probably has more to do with it. Via the MAX_UPDATES_PER_SECOND I've tried to linearize writes, with mixed success. But I've never gone to deep into it. So basically comparing write performance would be unfair in these circumstances, I am only comparing reads in these tests. Despite the different storage setups, the Linux block cache should make things fair for reads. Whisper's iowait will handicap the reads, but I always did successive runs with fully loaded PNGs to make sure the block cache was warm for reads.

A "slightly more professional" benchmark

I could have stopped here, but the validation above was not very scientific. I wanted to do a somewhat more formal benchmark, to measure read speeds (though I did not have much time so it had to be quick and easy).
I wanted to compare InfluxDB vs whisper, and specifically how performance scales as you play with parameters such as number of series, points per series, and time range fetched (i.e. amount of points). I posted the benchmark on the InfluxDB mailing list. Look there for all information. I just want to reiterate the conclusion here: I was surprised. Because of the results above, I had assumed that InfluxDB would perform reads noticeably quicker than whisper but this is not the case. (maybe because whisper reads are nicely sequential - it's mostly writes that suffer from the whisper format)
This very much contrasts my earlier findings where the graphite-api+InfluxDB powered dashboards clearly take the lead. I have yet to figure out why this is. Maybe something to do with the performance of graphite-web vs graphite-api itself, gunicorn vs apache, worker configuration, or maybe InfluxDB only starts outperforming whisper as concurrency increases. Some more investigation is definitely needed!

Future benchmarks

The simple benchmark above was very simple to execute, as it only requires influx-cli and whisper-fetch (so you can easily check for yourself), but clearly there is a need to test more realistic scenarios with concurrent reads, and doing some write benchmarks would be nice too.
We should also look into cpu and memory usage. I have had the luxury of being able to completely ignore memory usage, but others seem to notice excessive InfluxDB memory usage.
I would also like to see storage efficiency tests. Last time I checked, using LevelDB I was pretty close to 24B per record (which makes sense because time, seq_no and value are all 64bit values, and each record has those 3 fields). (this was with snappy enabled, so it didn't seem to give much benefit). With whisper, I have files where the file size in Bytes divided by total records comes down to 114, for others 31. I haven't looked much into it but it looks like at least InfluxDB is more storage efficient. Also, whisper explicitly encodes None values of course, with InfluxDB those are implied (and require no space)

conclusion: many tests and benchmarks should happen, but I don't really have time to conduct them. Hopefully other people in the community will take this on.

State of affairs and what's coming

  • InfluxDB typically performs pretty well, but not in all cases. More validation is needed. It wouldn't surprise me at this point if tools like hbase/Cassandra/riak clearly outperform InfluxDB, as long as we keep in mind that InfluxDB is a young project. A year, or two, from now, it'll probably perform much better. (and then again, it's not all about raw performance. InfluxDB's has other strengths)
  • A long time goal which is now a reality: You can use any Graphite dashboard on top of InfluxDB, as long as the data is stored in a graphite-compatible format.. Again, the easiest to get running is via graphite-api-influxdb-docker. There are two issues to be mentioned, though:
  • With the 0.8 release out the door, the shard spaces/rollups/retention intervals feature will start stabilizing, so we can start supporting multiple retention intervals per metric
  • Because InfluxDB clustering is undergoing major changes, and because clustering is not a high priority for me, I haven't needed to worry about this. I'll probably only start looking at clustering somewhere in 2015 because I have more pressing issues.
  • Once the new clustering system and the storage subsystem have matured (sounds like a v1.0 ~ v1.2 to me) we'll get more speed improvements and robustness. Most of the integration work is done, it's just a matter of doing smaller improvements, bug fixes and waiting for InfluxDB to become better. Maintaining this stack aside, I personally will start focusing more on:
    • per-second resolution in our data feeds, and potentially storage
    • realtime (but basic) anomaly detection, realtime graphs for some key timeseries. Adrian Cockcroft had an inspirational piece in his Monitorama keynote about how alerts from timeseries should trigger within seconds.
    • Mozilla's awesome heka project (this heka video is great), which should help a lot with the above. Also looking at Etsy's kale stack for anomaly detection
    • metrics 2.0 and making sure metrics 2.0 works well with InfluxDB. Up to now I find the series / columns as a data model too limiting and arbitrary, it could be so much more powerful, ditto for the query language.
  • Can we do anything else to make InfluxDB (+graphite) faster? Yes!
    • Long term, of course, InfluxDB should have powerful enough processing functions and query syntax, so that we don't even need a graphite layer anymore.
    • A storage engine optimized for fixed intervals would probably help, we could have the timestamps implicit instead of explicit. And maybe making the sequence number field optional. Each of these fields currently consumes 1/3 of the record... The sequence number field is not only useless in the Graphite use case, I've also rarely seen people make use of this in other use cases. Not storing the values as 64bit floats would help too. Finally we could have InfluxDB have fill in None values without it doing "group by" (timeframe consolidation)
    • Then of course, there are projects to replace graphite-web/graphite-api with a Go codebase: graphite-ng and carbonapi. the latter is more production ready, but depends on some custom tooling and io using protobufs. But it performs an order of magnitude better than the python api server! I haven't touched graphite-ng in a while, but hopefully at some point I can take it up again
  • Another thing to keep in mind when switching to graphite-api + InfluxDB: you loose the graphite composer. I have a few people relying on this, so I can either patch it to talk to graphite-api (meh), separate it out (meh) or replace it with a nicer dashboard like tessera, grafana or descartes. (or Graph-Explorer, but it can be a bit too much of a paradigm shift).
  • some more InfluxDB stuff I'm looking forward to:
    • binary protocol and result streaming (faster communication and responses!) (the latter might not get implemented though)
    • "list series" speed improvements (if metadata querying gets fast enough, we won't need ES anymore for metrics2.0 index)
    • InfluxDB instrumentation so we actually start getting an idea of what's going on in the system, a lot of the testing and troubleshooting is still in the dark.
  • Tracking exceptions in graphite-api is much harder than it should be. Currently there's no way to display exceptions to the user (in the http response) or to even log them. So sometimes you'll get http 500 responses and don't know why. You can use the sentry integration which works all right, but is clunky. Hopefully this will be addressed soon.

Conclusion

The graphite-influxdb stack works and is ready for general consumption. It's easy to install and operate, and performs well. It is expected that InfluxDB will over time mature and ultimately meet all my requirements of the ideal backend. It definitely has a long way to go. More benchmarks and tests are needed. Keep in mind that we're not doing large volumes of metrics. For small/medium shops this solution should work well, but on larger scales you will definitely run into issues. You might conclude that InfluxDB is not for you (yet) (there are alternative projects, after all).

Finally, a closing thought:
Having graphs and dashboards that look nice and load fast is a good thing to have, but keep in mind that graphs and dashboards should be a last resort. It's a solution if all else fails. The fewer graphs you need, the better you're doing.
How can you avoid needing graphs? Automatic alerting on your data.

I see graphs as a temporary measure: they provide headroom while you develop an understanding of the operational behavior of your infrastructure, conceive a model of it, and implement the alerting you need to do troubleshooting and capacity planning. Of course, this process consumes more resources (time and otherwise), and these expenses are not always justifiable, but I think this is the ideal case we should be working towards.


Either way, good luck and have fun!

Graphite &amp; Influxdb intermezzo: migrating old data and a more powerful carbon relay

Migrating data from whisper into InfluxDB

"How do i migrate whisper data to influxdb" is a question that comes up regularly, and I've always replied it should be easy to write a tool to do this. I personally had no need for this, until a recent small influxdb outage where I wanted to sync data from our backup server (running graphite + whisper) to influxdb, so I wrote a script:
#!/bin/bash
# whisper dir without trailing slash.
wsp_dir=/opt/graphite/storage/whisper
start=$(date -d 'sep 17 6am' +%s)
end=$(date -d 'sep 17 12pm' +%s)
db=graphite
pipe_path=$(mktemp -u)
mkfifo $pipe_path
function influx_updater() {
    influx-cli -db $db -async < $pipe_path
}
influx_updater &
while read wsp; do
  series=$(basename ${wsp////.} .wsp)
  echo "updating $series ..."
  whisper-fetch.py --from=$start --until=$end $wsp_dir/$wsp.wsp | grep -v 'None$' | awk '{print "insert into "'$series'" values ("$1"000,1,"$2")"}' > $pipe_path
done < <(find $wsp_dir -name '*.wsp' | sed -e "s#$wsp_dir/##" -e "s/.wsp$//")

It relies on the recently introduced asynchronous inserts feature of influx-cli - which commits inserts in batches to improve the speed - and the whisper-fetch tool.
You could probably also write a Go program using the unofficial whisper-go bindings and the influxdb Go client library. But I wanted to keep it simple. Especially when I found out that whisper-fetch is not a bottleneck: starting whisper-fetch, and reading out - in my case - 360 datapoints of a file always takes about 50ms, whereas InfluxDB at first only needed a few ms to flush hundreds of records, but that soon increased to seconds.
Maybe it's a bug in my code, I didn't test this much, because I didn't need to; but people keep asking for a tool so here you go. Try it out and maybe you can fix a bug somewhere. Something about the write performance here must be wrong.

A more powerful carbon-relay-ng

carbon-relay-ng received a bunch of love and has been a great help in my graphite+influxdb experiments.

Here's what changed:
  • First I made it so that you can adjust routes at runtime while data is flowing through, via a telnet interface.
  • Then Paul O'Connor built an embedded web interface to manage your routes in an easier and prettier way (pictured above)
  • The relay now also emits performance metrics via statsd (I want to make this better by using go-metrics which will hopefully get expvar support at some point - any takers?).
  • Last but not least, I borrowed the diskqueue code from NSQ so now we can also spool to disk to bridge downtime of endpoints and re-fill them when they come back up
Beside our metrics storage, I also plan to put our anomaly detection (currently playing with heka and kale) and carbon-tagger behind the relay, centralizing all routing logic, making things more robust, and simplifying our system design. The spooling should also help to deploy to our metrics gateways at other datacenters, to bridge outages of datacenter interconnects.

I used to think of carbon-relay-ng as the python carbon-relay but on steroids, now it reminds me more of something like nsqd but with an ability to make packet routing decisions by introspecting the carbon protocol,
or perhaps Kafka but much simpler, single-node (no HA), and optimized for the domain of carbon streams.
I'd like the HA stuff though, which is why I spend some of my spare time figuring out the intricacies of the increasingly popular raft consensus algorithm. It seems opportune to have a simpler Kafka-like thing, in Go, using raft, for carbon streams. (note: InfluxDB might introduce such a component, so I'm also a bit waiting to see what they come up with)

Reminder: notably missing from carbon-relay-ng is round robin and sharding. I believe sharding/round robin/etc should be part of a broader HA design of the storage system, as I explained in On Graphite, Whisper and InfluxDB. That said, both should be fairly easy to implement in carbon-relay-ng, and I'm willing to assist those who want to contribute it.

Introducing: Oh My Vagrant!

If you’re a reader of my code or of this blog, it’s no secret that I hack on a lot of puppet and vagrant. Recently I’ve fooled around with a bit of docker, too. I realized that the vagrant, environments I built for puppet-gluster and puppet-ipa needed to be generalized, and they needed new features too. Therefore…

Introducing: Oh My Vagrant!

Oh My Vagrant is an attempt to provide an easy to use development environment so that you can be up and hacking quickly, and focusing on the real devops problems. The README explains my choice of project name.

Prerequisites:

I use a Fedora 20 laptop with vagrant-libvirt. Efforts are underway to create an RPM of vagrant-libvirt, but in the meantime you’ll have to read: Vagrant on Fedora with libvirt (reprise). This should work with other distributions too, but I don’t test them very often. Please step up and help test :)

The bits:

First clone the oh-my-vagrant repository and look inside:

git clone --recursive https://github.com/purpleidea/oh-my-vagrant
cd oh-my-vagrant/vagrant/

The included Vagrantfile is the current heart of this project. You’re welcome to use it as a template and edit it directly, or you can use the facilities it provides. I’d recommend starting with the latter, which I’ll walk you through now.

Getting started:

Start by running vagrant status (vs) and taking a look at the vagrant.yaml file that appears.

james@computer:/oh-my-vagrant/vagrant$ ls
Dockerfile  puppet/  Vagrantfile
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)

The Libvirt domain is not created. Run `vagrant up` to create it.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms: []
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$

Here you’ll see the list of resultant machines that vagrant thinks is defined (currently just template1), and a bunch of different settings in YAML format. The values of these settings help define the vagrant environment that you’ll be hacking in.

Changing settings:

The settings exist so that your vagrant environment is dynamic and can be changed quickly. You can change the settings by editing the vagrant.yaml file. They will be used by vagrant when it runs. You can also change them at runtime with --vagrant-foo flags. Running a vagrant status will show you how vagrant currently sees the environment. Let’s change the number of machines that are defined. Note the location of the --vagrant-count flag and how it doesn’t work when positioned incorrectly.

james@computer:/oh-my-vagrant/vagrant$ vagrant status --vagrant-count=4
An invalid option was specified. The help for this command
is available below.

Usage: vagrant status [name]
    -h, --help                       Print this help
james@computer:/oh-my-vagrant/vagrant$ vagrant --vagrant-count=4 status
Current machine states:

template1                 not created (libvirt)
template2                 not created (libvirt)
template3                 not created (libvirt)
template4                 not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms: []
:namespace: template
:count: 4
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$

As you can see in the above example, changing the count variable to 4, causes vagrant to see a possible four machines in the vagrant environment. You can change as many of these parameters at a time by using the --vagrant- flags, or you can edit the vagrant.yaml file. The latter is much easier and more expressive, in particular for expressing complex data types. The former is much more powerful when building one-liners, such as:

vagrant --vagrant-count=8 --vagrant-namespace=gluster up gluster{1..8}

which should bring up eight hosts in parallel, named gluster1 to gluster8.

Other VM’s:

Since one often wants to be more expressive in machine naming and heterogeneity of machine type, you can specify a list of machines to define in the vagrant.yaml file vms array. If you’d rather define these machines in the Vagrantfile itself, you can also set them up in the vms array defined there. It is empty by default, but it is easy to uncomment out one of the many examples. These will be used as the defaults if nothing else overrides the selection in the vagrant.yaml file. I’ve uncommented a few to show you this functionality:

james@computer:/oh-my-vagrant/vagrant$ grep example[124] Vagrantfile 
    {:name => 'example1', :docker => true, :puppet => true, },    # example1
    {:name => 'example2', :docker => ['centos', 'fedora'], },    # example2
    {:name => 'example4', :image => 'centos-6', :puppet => true, },    # example4
james@computer:/oh-my-vagrant/vagrant$ rm vagrant.yaml # note that I remove the old settings
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)
example1                  not created (libvirt)
example2                  not created (libvirt)
example4                  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms:
- :name: example1
  :docker: true
  :puppet: true
- :name: example2
  :docker:
  - centos
  - fedora
- :name: example4
  :image: centos-6
  :puppet: true
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$ vim vagrant.yaml # edit vagrant.yaml file...
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms:
- :name: example1
  :docker: true
  :puppet: true
- :name: example4
  :image: centos-7.0
  :puppet: true
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)
example1                  not created (libvirt)
example4                  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$

The above output might seem a little long, but if you try these steps out in your terminal, you should get a hang of it fairly quickly. If you poke around in the Vagrantfile, you should see the format of the vms array. Each element in the array should be a dictionary, where the keys correspond to the flags you wish to set. Look at the examples if you need help with the formatting.

Other settings:

As you saw, other settings are available. There are a few notable ones that are worth mentioning. This will also help explain some of the other features that this Vagrantfile provides.

  • domain: This sets the domain part of each vm’s FQDN. The default is example.com, which should work for most environments, but you’re welcome to change this as you see fit.
  • network: This sets the network that is used for the vm’s. You should pick a network/cidr that doesn’t conflict with any other networks on your machine. This is particularly useful when you have multiple vagrant environments hosted off of the same laptop.
  • image: This is the default base image to use for each machine. It can be overridden per-machine in the vm’s list of dictionaries.
  • sync: This is the sync type used for vagrant. rsync is the default and works in all environments. If you’d prefer to fight with the nfs mounts, or try out 9p, both those options are available too.
  • puppet: This option enables or disables integration with puppet. It is possible to override this per machine. This functionality will be expanded in a future version of Oh My Vagrant.
  • docker: This option enables and lists the docker images to set up per vm. It is possible to override this per machine. This functionality will be expanded in a future version of Oh My Vagrant.
  • namespace: This sets the namespace that your Vagrantfile operates in. This value is used as a prefix for the numbered vm’s, as the libvirt network name, and as the primary puppet module to execute.

More on the docker option:

For now, if you specify a list of docker images, they will be automatically pulled into your vm environment. It is recommended that you pre-cache them in an existing base image to save bandwidth. Custom base vagrant images can be easily be built with vagrant-builder, but this process is currently undocumented.

I’ll try to write-up a post on this process if there are enough requests. To keep you busy in the meantime, I’ve published a CentOS 7 vagrant base image that includes docker images for CentOS and Fedora. It is being graciously hosted by the GlusterFS community.

What other magic does this all do?

There is a certain amount of magic glue that happens behind the scenes. Here’s a list of some of it:

  • Idempotent /etc/hosts based DNS
  • Easy docker base image installation
  • IP address calculations and assignment with ipaddr
  • Clever cleanup on ‘vagrant destroy
  • Vagrant docker base image detection
  • Integration with Puppet

If you don’t understand what all of those mean, and you don’t want to go source diving, don’t worry about it! I will explain them in greater detail when it’s important, and hopefully for now everything “just works” and stays out of your way.

Future work:

There’s still a lot more that I have planned, and some parts of the Vagrantfile need clean up, but I figured I’d try and release this early so that you can get hacking right away. If it’s useful to you, please leave a comment and let me know.

Happy hacking,

James

 


Translations Between Domains: David Woods

One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how do people cope with increasing amounts of complexity.

As a result, my journey took me deep into a world where I immediately saw connections — between concepts found in other high-tempo, high-consequence domains and my own world of software engineering and operations. One of the first connections was in Richard Cook’s How Complex Systems Fail, and it struck me so deeply I insisted that it get reprinted (with additions by Richard) into O’Reilly’s Web Operations book.

I simply cannot un-see these connections now, and the field of study keeps me going deeper. So deep that I felt I needed to get a degree. My goal with getting a degree in the topic is not just to satisfy my own curiosity, but also to explore these topics in sufficient depth to feel credible in thinking about them critically.

In software, the concept and sometimes inadvertent practice of “cargo cult engineering” is well known. I’m hoping to avoid that in my own translation(s) of what’s been found in human factors, safety science, and cognitive systems engineering, as they looked into domains like aviation, patient safety, or power plant operations. Instead, I’m looking to truly understand that work in order to know what to focus on in my own research as well as to understand how my domain is either similar (and in what ways?) or different (and in what ways?)

For example, just a hint of what sorts of questions I have been mulling over:

  • How does the concept of “normalization of deviance” manifest in web engineering? How does it relate to our concept of ‘technical debt’?
  • What organizational dynamics might be in play when it comes to learning from “successes” and “failures”?
  • What methods of inquiry can we use to better design interfaces that have functionality and safety and diagnosis support as their core? Or, are those goals in conflict? If so, how?
  • How can we design alerts to reduce noise and increase signal in a way that takes into account the context of the intended receiver of the alert? In other words, how can we teach alerts to know about us, instead of the other way around?
  • The Internet (include its technical, political, and cultural structures) has non-zero amounts of diversity, interdependence, connectedness, and adaptation, which by many measures constitutes a complex system.
  • How do successful organizations navigate trade-offs when it comes to decisions that may have unexpected consequences?

I’ve done my best to point my domain at some of these connections as I understand them, and the Velocity Conference has been one of the ways I’ve hoped to bring people “over the bridge” from Safety Science, Human Factors, and Cognitive Systems Engineering into software engineering and operations as it exists as a practice on Internet-connected resources. If you haven’t seen Dr. Richard Cook’s 2012 and 2013 keynotes, or Dr. Johan Bergstrom’s keynote, stop what you’re doing right now and watch them.

I’m willing to bet you’ll see connections immediately…



DavidWoodsDavid Woods is one of the pioneers in these fields, and continues to be a huge influence on the way that I think about our domain and my own research (my thesis project relies heavily on some of his previous work) and I can’t be happier that he’s speaking at Velocity in New York, which is coming up soon. (Pssst: if you register for it here, you can use the code “JOHN20″ for 20% discount)

I have posted before (and likely will again) about a paper Woods contributed to, Common Ground and Coordination in Joint Activity (Klein, Feltovich, Bradshaw, & Woods, 2005) which in my mind might as well be considered the best explanation on what “devops” means to me, and what makes successful teams work. If you haven’t read it, do it now.

 

Dynamic Fault Management and Anomaly Response

I thought about listing all of Woods’ work that I’ve seen connections in thus far, but then I realized that if I wasn’t careful, I’d be writing a literature review and not a blog post. :) Also, I have thesis work to do. So for now, I’d like to point only at two concepts that struck me as absolutely critical to the day-to-day of many readers of this blog, dynamic fault management and anomaly response.

Woods sheds some light on these topics in Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. Pay particular attention to the characteristics of these phenomenons:

“In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity—what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem.

Anomaly response situations frequently involve time pressure, multiple interacting goals, high consequences of failure, and multiple interleaved tasks (Woods, 1988; 1994). Typical examples of fields of practice where dynamic fault management occurs include flight deck operations in commercial aviation (Abbott, 1990), control of space systems (Patterson et al., 1999; Mark, 2002), anesthetic management under surgery (Gaba et al., 1987), terrestrial process control (Roth, Woods & Pople, 1992), and response to natural disasters.” (Woods & Hollnagel, 2006, p.71)

Now look down at the distributed systems you’re designing and operating.

Look at the “runbooks” and postmortem notes that you have written in the hopes that they can help guide teams as they try to untangle the sometimes very confusing scenarios that outages can bring.

Does “safing” ring familiar to you?

Do you recognize managing “multiple interleaved tasks” under “time pressure” and “high consequences of failure”?

I think it’s safe to say that almost every Velocity Conference attendee would see connections here.

In How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands (Woods & Patterson, 1999), he introduces the concept of escalation, in terms of anomaly response:

The concept of escalation captures a dynamic relationship between the cascade of effects that follows from an event and the demands for cognitive and collaborative work that escalate in response (Woods, 1994). An event triggers the evolution of multiple interrelated dynamics.

  • There is a cascade of effects in the monitored process. A fault produces a time series of disturbances along lines of functional and physical coupling in the process (e.g., Abbott, 1990). These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications (Reiersen, Marshall, & Baker, 1988).
  • Demands for cognitive activity increase as the problem cascades. More knowledge potentially needs to be brought to bear. There is more to monitor. There is a changing set of data to integrate into a coherent assessment. Candidate hypotheses need to be generated and evaluated. Assessments may need to be revised as new data come in. Actions to protect the integrity and safety of systems need to be identified, carried out, and monitored for success. Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.
  • Demands for coordination increase as the problem cascades. As the cognitive activities escalate, the demand for coordination across people and across people and machines rises. Knowledge may reside in different people or different parts of the operational system. Specialized knowledge and expertise from other parties may need to be brought into the problem-solving process. Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process. The trouble in the underlying process requires informing and updating others – those whose scope of responsibility may be affected by the anomaly, those who may be able to support recovery, or those who may be affected by the consequences the anomaly could or does produce.
  • The cascade and escalation is a dynamic process. A variety of complicating factors can occur, which move situations beyond canonical, textbook forms. The concept of escalation captures this movement from canonical to nonroutine to exceptional. The tempo of operations increases following the recognition of a triggering event and is synchronized by temporal landmarks that represent irreversible decision points.

When I read…

“These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications” 

I think of many large-scale outages and multi-day recovery activities, like this one that you all might remember (AWS EBS/RDS outage, 2011).

When I read…

“Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.” 

I think of many outage response scenarios I have been in with multiple teams (network, storage, database, security, etc.) gathering data from places they

When I read…

“Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process.”

I think of these two particular outages, and how in the fog of ambiguous signals coming in during diagnosis of an issue, there is a “divide and conquer” effort distributed throughout differing domain expertise (database, network, various software layers, hardware, etc.) that aims to split the search space of diagnosis, while at the same time keeping each other up-to-date on what pathologies have been eliminated as possibilities, what new data can be used to form hypotheses about what’s going on, etc.

I will post more on the topic of anomaly response in detail (and more of Woods’ work) in another post.

In the meantime, I urge you to take a look at David Woods’ writings, and look for connections in your own work. Below is a talk David gave at IBM’s Almaden Research Center, called “Creating Safety By Engineering Resilience”:

David D. Woods, Creating Safety by Engineering Resilience from jspaw on Vimeo.

References

Hollnagel, E., & Woods, D. D. (1983). Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.

Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 139–184.

Woods, D. D. (1995). The alarm problem and directed attention in dynamic fault management. Ergonomics. doi:10.1080/00140139508925274

Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems : patterns in cognitive systems engineering. Boca Raton : CRC/Taylor & Francis.

Woods, D. D., & Patterson, E. S. (1999). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands. Stress, 1–13.

Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis. Cognition, Technology & Work, 4(1), 22–36. doi:10.1007/s101110200002

Teaching Engineering As A Social Science

Below is a piece written by Edward Wenk, Jr., which originally appeared in PRlSM, the magazine for the American Society for Engineering Education (Publication Volume 6. No. 4. December 1996.)

While I think that there’s much more than what Wenk points to as ‘social science’ – I agree wholeheartedly with his ideas. I might even say that he didn’t go far enough in his recommendations.

Enjoy. :)

 

Edward Wenk, Jr.

Teaching Engineering as a Social Science

Today’s public engages in a love affair with technology, yet it consistently ignores the engineering at technology’s core. This paradox is reinforced by the relatively few engineers in leadership positions. Corporations, which used to have many engineers on their boards of directors, today are composed mainly of M.B.A.s and lawyers. Few engineers hold public office or even run for office. Engineers seldom break into headlines except when serious accidents are attributed to faulty design.

While there are many theories on this lack of visibility, from inadequate public relations to inadequate public schools, we may have overlooked the real problem: Perhaps people aren’t looking at engineers because engineers aren’t looking at people.

If engineering is to be practiced as a profession, and not just a technical craft, engineers must learn to harmonize natural sciences with human values and social organization. To do this we must begin to look at engineering as a social science and to teach, practice, and present engineering in this context.

To many in the profession, looking at teaching engineering as a social science is anathema. But consider the multiple and profound connections of engineering to people.

Technology in Everyday Life

The work of engineers touches almost everyone every day through food production, housing, transportation, communications, military security, energy supply, water supply, waste disposal, environmental management, health care, even education and entertainment. Technology is more than hardware and silicon chips.

In propelling change and altering our belief systems and culture, technology has joined religion, tradition, and family in the scope of its influence. Its enhancements of human muscle and human mind are self-evident. But technology is also a social amplifier. It stretches the range, volume, and speed of communications. It inflates appetites for consumer goods and creature comforts. It tends to concentrate wealth and power, and to increase the disparity of rich and poor. In the com- petition for scarce resources, it breeds conflicts.

In social psychological terms, it alters our perceptions of space. Events anywhere on the globe now have immediate repercussions everywhere, with a portfolio of tragedies that ignite feelings of helplessness. Technology has also skewed our perception of time, nourishing a desire for speed and instant gratification and ignoring longer-term impacts.

Engineering and Government

All technologies generate unintended consequences. Many are dangerous enough to life, health, property, and environment that the public has demanded protection by the government.

Although legitimate debates erupt on the size of government, its cardinal role is demonstrated in an election year when every faction seeks control. No wonder vested interests lobby aggressively and make political campaign contributions.

Whatever that struggle, engineers have generally opted out. Engineers tend to believe that the best government is the least government, which is consistent with goals of economy and efficiency that steer many engineering decisions without regard for social issues and consequences.

Problems at the Undergraduate Level

By both inclination and preparation, many engineers approach the real world as though it were uninhabited. Undergraduates who choose an engineering career often see it as escape from blue- collar family legacies by obtaining the social prestige that comes with belonging to a profession. Others love machines. Few, however, are attracted to engineering because of an interest in people or a commitment to public service. On the contrary, most are uncomfortable with the ambiguities human behavior, its absence of predictable cause and effect, its lack of control, and with the demands for direct encounters with the public.

Part of this discomfort originates in engineering departments, which are often isolated from arts, humanities, and social sciences classrooms by campus geography as well as by disparate bodies of scholarly knowledge and cultures. Although most engineering departments require students to take some nontechnical courses, students often select these on the basis of hearsay, academic ease, or course instruction, not in terms of preparation for life or for citizenship.

Faculty attitudes don’t help. Many faculty members enter teaching immediately after obtaining their doctorates, their intellect sharply honed by a research specialty. Then they continue in that groove because of standard academic reward systems for tenure and promotion. Many never enter a professional practice that entails the human equation.

We can’t expect instant changes in engineering education. A start, however, would be to recognize that engineering is more than manipulation of intricate signs and symbols. The social context is not someone else’s business. Adopting this mindset requires a change in attitudes. Consider these axioms:

  • Technology is not just hardware; it is a social process.
  • All technologies generate side effects that engineers should try to anticipate and to protect against.
  • The most strenuous challenge lies in synthesis of technical, social, economic, environmental, political, and legal processes.
  • For engineers to fulfill a noblesse oblige to society, the objectivity must not be defined by conditions of employment, as, for example, in dealing with tradeoffs by an employer of safety for cost.

In a complex, interdependent, and sometimes chaotic world, engineering practice must continue to excel in problem solving and creative synthesis. But today we should also emphasize social responsibility and commitment to social progress. With so many initiatives having potentially unintended consequences, engineers need to examine how to serve as counselors to the public in answering questions of “What if?” They would thus add sensitive, future-oriented guidance to the extraordinary power of technology to serve important social purposes.

In academic preparation, most engineering students miss exposure to the principles of social and economic justice and human rights, and to the importance of biological, emotional, and spiritual needs. They miss Shakespeare’s illumination of human nature – the lust for power and wealth and its corrosive effects on the psyche, and the role of character in shaping ethics that influence professional practice. And they miss models of moral vision to face future temptations.

Engineering’s social detachment is also marked by a lack of teaching about the safety margins that accommodate uncertainties in engineering theories, design assumptions, product use and abuse, and so on. These safety margins shape practice with social responsibility to minimize potential harm to people or property. Our students can learn important lessons from the history of safety margins, especially of failures, yet most use safety protocols without knowledge of that history and without an understanding of risk and its abatement. Can we expect a railroad systems designer obsessed with safety signals to understand that sleep deprivation is even more likely to cause accidents? No, not if the systems designer lacks knowledge of this relatively common problem.

Safety margins are a protection against some unintended consequences. Unless engineers appreciate human participation in technology and the role of human character in performance, they are unable to deal with demons that undermine the intended benefits.

Case Studies in Socio-Technology

Working for the legislative and executive branches of US. government since the 1950s, I have had a ringside seat from which to view many of the events and trends that come from the connections between engineering and people. Following are a few of those cases.

Submarine Design

The first nuclear submarine, USS Nautilus, was taken on its deep submergence trial February 28, I955. The subs’ power plant had been successfully tested in a full-scale mock-up and in a shallow dive, but the hull had not been subject to the intense hydrostatic pressure at operating depth. The hull was unprecedented in diameter, in materials, and in special joints connecting cylinders of different diameter. Although it was designed with complex shell theory and confirmed by laboratory tests of scale models, proof of performance was still necessary at sea.

During the trial, the sub was taken stepwise to its operating depth while evaluating strains. I had been responsible for the design equations, for the model tests, and for supervising the test at sea, so it was gratifying to find the hull performed as predicted.

While the nuclear power plant and novel hull were significant engineering achievements, the most important development occurred much earlier on the floor of the US. Congress. That was where the concept of nuclear propulsion was sold to a Congressional committee by Admiral Hyman Rickover, an electrical engineer. Previously rejected by a conservative Navy, passage of the proposal took an electrical engineer who understood how Constitutional power was shared and how to exercise the right of petition. By this initiative, Rickover opened the door to civilian nuclear power that accounts for 20 percent of our electrical generation, perhaps 50 percent in France. If he had failed, and if the Nautilus pressure hull had failed, nuclear power would have been set back by a decade.

Space Telecommunications

Immediately after the 1957 Soviet surprise of Sputnik, engineers and scientists recognized that global orbits required all nations to reserve special radio channels for telecommunications with spacecraft. Implementation required the sanctity of a treaty, preparation of which demanded more than the talents of radio specialists; it engaged politicians, space lawyers, and foreign policy analysts. As science and technology advisor to Congress, I evaluated the treaty draft for technical validity and for consistency with U.S. foreign policy.

The treaty recognized that the airwaves were a common property resource, and that the virtuosity of communications engineering was limited without an administrative protocol to safeguard integrity of transmissions. This case demonstrated that all technological systems have three major components — hardware or communications equipment; software or operating instructions (in terms of frequency assignments); and peopleware, the organizations that write and implement the instructions.

National Policy for the Oceans

Another case concerned a national priority to explore the oceans and to identify U.S. rights and responsibilities in the exploitation and conservation of ocean resources. This issue, surfacing in 1966, was driven by new technological capabilities for fishing, offshore oil development, mining of mineral nodules on the ocean floor, and maritime shipment of oil in supertankers that if spilled could contaminate valuable inshore waters. Also at issue was the safety of those who sailed and fished.

This issue had a significant history. During the late 1950s, the US. Government was downsizing oceanographic research that initially had been sponsored during World War II. This was done without strong objection, partly because marine issues lacked coherent policy or high-level policy leadership and strong constituent advocacy.

Oceanographers, however, wanting to sustain levels of research funding, prompted a study by the National Academy of Sciences (NAS), Using the reports findings, which documented the importance of oceanographic research, NAS lobbied Congress with great success, triggering a flurry of bills dramatized by such titles as “National Oceanographic Program.”

But what was overlooked was the ultimate purpose of such research to serve human needs and wants, to synchronize independent activities of major agencies, to encourage public/private partnerships, and to provide political leadership. During the 1960s, in the role of Congressional advisor, I proposed a broad “strategy and coordination machinery” centered in the Office of the President, the nation’s systems manager. The result was the Marine Resources and Engineering Development Act, passed by Congress and signed into law by President Johnson in 1966.

The shift in bill title reveals the transformation from ocean sciences to socially relevant technology, with engineering playing a key role. The legislation thus embraced the potential of marine resources and the steps for both development and protection. By emphasizing policy, ocean activities were elevated to a higher national priority.

Exxon Valdez

Just after midnight on March 24, 1989, the tanker Exxon Valdez, loaded with 50 million gallons of Alaska crude oil, fetched up on Bligh Reef in Prince William Sound and spilled its guts. For five hours, oil surged from the torn bottom at an incredible rate of 1,000 gallons per second. Attention quickly focused on the enormity of environmental damage and on blunders of the ship operators. The captain had a history of alcohol abuse, but was in his cabin at impact. There was much finger- pointing as people questioned how the accident could happen during a routine run on a clear night. Answers were sought by the National Transportation Safety Board and by a state of Alaska commission to which I was appointed. That blame game still continues in the courts.

The commission was instructed to clarify what happened, why, and how to keep it from happening again. But even the commission was not immune to the political blame game. While I wanted to look beyond the ship’s bridge and search for other, perhaps more systemic problems, the commission chair blocked me from raising those issues. Despite my repeated requests for time at the regularly scheduled sessions, I was not allowed to speak. The chair, a former official having tanker safety responsibilities in Alaska, had a different agenda and would only let the commission focus largely on cleanup rather than prevention. Fortunately, I did get to have my say by signing up as a witness and using that forum to express my views and concerns.

The Exxon Valdez proved to be an archetype of avoidable risk. Whatever the weakness in the engineered hardware, the accident was largely due to internal cultures of large corporations obsessed with the bottom line and determined to get their way, a U.S. Coast Guard vulnerable to political tampering and unable to realize its own ethic, a shipping system infected with a virus of tradition, and a cast of characters lulled into complacency that defeated efforts at prevention.

Lessons

These examples of technological delivery systems have unexpected commonalities. Space telecommunications and sea preservation and exploitation were well beyond the purview of just those engineers and scientists working on the projects; they involved national policy and required interaction between engineers, scientists, users, and policymakers. The Exxon Valdez disaster showed what happens when these groups do not work together. No matter how conscientious a ship designer is about safety, it is necessary to anticipate the weaknesses of fallibility and
the darker side of self-centered, short-term ambition.

Recommendations

Many will argue that the engineering curriculum is so overloaded that the only source of socio- technical enrichment is a fifth year. Assuming that step is unrealistic, what can we do?

  • The hodge podge of nonengineering courses could be structured to provide an integrated foundation in liberal arts.
  • Teaching at the upper division could be problem- rather than discipline-oriented, with examples from practice that integrate nontechnical parameters.
  • Teaching could employ the case method often used in law, architecture, and business.
  • Students could be encouraged to learn about the world around them by reading good newspapers and nonengineering journals.
  • Engineering students could be encouraged to join such extracurricular activities as debating or political clubs that engage students from across the campus.

As we strengthen engineering’s potential to contribute to society, we can market this attribute to women and minority students who often seek socially minded careers and believe that engineering is exclusively a technical pursuit.

For practitioners of the future, something radically new needs to be offered in schools of engineering. Otherwise, engineers will continue to be left out.