Category → devops
DevOps and the Iteration Showcase
Look down. Look up again. You’re on the agile team your team could be like.
It’s the end of the iteration, and there’s a showcase this afternoon (sprint demo if you prefer) demonstrating all the new functionality the team has built in the last two weeks. In the room are members of the project team, the product owner, and various stakeholders and interested parties from the marketing and customer service teams who use the product every day. Everyone’s very excited about the new features, and provide some great feedback on the spot.
This sounds great! But something is missing. Where are the ‘ops’ features?
Very few agile projects I’ve been on will demonstrate the ‘cross-functional’* or ops features they’ve completed in the same showcase, but they SHOULD. Features like monitoring, failover testing, deployment automation, performance improvements – these are all very important to our business. If you’ve truly got a DevOps culture, then they should be showcased and celebrated alongside the new whizzy UI features.
How do we make these achievements relevant to a wider audience? Start by describing the work in a different way – talk about the work that’s being done in terms of its benefit to our business.
A technical story would look like:
Enable monitoring of JVM heap allocation.
To make it more understandable to the business, highlight the business benefit in this way:
In order to reduce the risk of an outage as site traffic grows
The operations team need to
Monitor the JVM heap memory allocation
By putting the business benefit up front (and always present) this should help make the story more interesting to showcase.
The regular showcase presentation is also an opportunity to report to the stakeholder group on the current state of the system in production. This can take the form of presenting some selected metrics plotted over time. For a website you might include metrics on site traffic, response times, performance and stability over time. The presentation should support the prioritisation of appropriate cross-functional work to improve those metrics over time.
Getting to the point where cross-functional work is celebrated by a wider stakeholder group requires some creativity and effort. When it works I’ve observed it makes the conversations around proper prioritisation and collaboration on DevOps work so much easier.
* I’ve taken to using the term ‘cross-functional requirements’ (thanks to Sarah) to describe requirements that are cross-cutting and not-directly-functional – for example performance, availablity, volume, maintainability. I think the term NFR has become a weasel-word, treated as ‘someone else’s problem’ rather than an important priority. It might just be a word game, but I think it’s useful.
Effective adhoc commands in clusters
Last night I had a bit of a mental dump on twitter about structured data and non structured data when communicating with a cluster or servers – Twitter fails at this kind of stuff so figured I’ll follow up with a blog post.
I started off asking for a list of tools in the cluster admin space and got some great pointers which I am reproducing here:
fabric, cap, func, clusterssh, sshpt, pssh, massh, clustershell, controltier, rash (related), dsh, chef knife ssh, pdsh+dshbak and of course mcollective. I was also sent a list of ssh related tools which is awesome.
The point I feel needs to be made is that in general these tools just run commands on remote servers. They are not aware of the commands output structure, what denotes pass or fail in the context of the command etc. Basically the commands people run are commands designed for ages to be looked at by human eyes and then parsed by a human mind. Yes they are easy to pipe and grep and chop up, but ultimately it was always designed to be run on one server at a time.
The parallel ssh’ers run these commands in parallel and you tend to get a mash of output. The output is mixed STDOUT and STDERR and often output from different machines are multiplexed into each other so you get a stream of text that is hard to decipher even on 2 machines, not to mention 200 at once.
Take as an example a simple yum command to install a package:
% yum install zsh Loaded plugins: fastestmirror, priorities, protectbase, security Loading mirror speeds from cached hostfile 372 packages excluded due to repository priority protections 0 packages excluded due to repository protections Setting up Install Process Package zsh-4.2.6-3.el5.i386 already installed and latest version Nothing to do
When run on one machine you pretty much immediately know whats going on, package was already there so nothing got done, now lets see cap invoke:
# cap invoke COMMAND="yum -y install zsh"
* executing `invoke'
* executing "yum -y install zsh"
servers: ["web1", "web2", "web3"]
[web2] executing command
[web1] executing command
[web3] executing command
** [out :: web2] Loaded plugins: fastestmirror, priorities, protectbase, security
** [out :: web2] Loading mirror speeds from cached hostfile
** [out :: web3] Loaded plugins: fastestmirror, priorities, protectbase
** [out :: web3] Loading mirror speeds from cached hostfile
** [out :: web3] 495 packages excluded due to repository priority protections
** [out :: web2] 495 packages excluded due to repository priority protections
** [out :: web3] 0 packages excluded due to repository protections
** [out :: web3] Setting up Install Process
** [out :: web2] 0 packages excluded due to repository protections
** [out :: web2] Setting up Install Process
** [out :: web1] Loaded plugins: fastestmirror, priorities, protectbase
** [out :: web3] Package zsh-4.2.6-3.el5.x86_64 already installed and latest version
** [out :: web3] Nothing to do
** [out :: web1] Loading mirror speeds from cached hostfile
** [out :: web1] Install 1 Package(s)
** [out :: web2] Package zsh-4.2.6-3.el5.x86_64 already installed and latest version
** [out :: web2] Nothing to do
** [out :: web1] 548 packages excluded due to repository priority protections
** [out :: web1] 0 packages excluded due to repository protections
** [out :: web1] Setting up Install Process
** [out :: web1] Resolving Dependencies
** [out :: web1] --> Running transaction check
** [out :: web1] ---> Package zsh.x86_64 0:4.2.6-3.el5 set to be updated
** [out :: web1] --> Finished Dependency Resolution
** [out :: web1]
** [out :: web1] Dependencies Resolved
** [out :: web1]
** [out :: web1] ================================================================================
** [out :: web1] Package Arch Version Repository Size
** [out :: web1] ================================================================================
** [out :: web1] Installing:
** [out :: web1] zsh x86_64 4.2.6-3.el5 centos-base 1.7 M
** [out :: web1]
** [out :: web1] Transaction Summary
** [out :: web1] ================================================================================
** [out :: web1] Install 1 Package(s)
** [out :: web1] Upgrade 0 Package(s)
** [out :: web1]
** [out :: web1] Total download size: 1.7 M
** [out :: web1] Downloading Packages:
** [out :: web1] Running rpm_check_debug
** [out :: web1] Running Transaction Test
** [out :: web1] Finished Transaction Test
** [out :: web1] Transaction Test Succeeded
** [out :: web1] Running Transaction
** [out :: web1] Installing : zsh 1/1
** [out :: web1]
** [out :: web1]
** [out :: web1] Installed:
** [out :: web1] zsh.x86_64 0:4.2.6-3.el5
** [out :: web1]
** [out :: web1] Complete!
command finished
zlib(finalizer): the stream was freed prematurely.
zlib(finalizer): the stream was freed prematurely.
zlib(finalizer): the stream was freed prematurely.Most of this stuff scrolled off my screen and at the end all I had was the last bit of output. I could scroll up and still figure out ok what was going on – 2 of the 3 already had it installed, one got it. Now imagine 100 or 500 of these machines output all mixed in? Just parsing this output would be prone to human error and you’re likely to miss that something failed.
So here is my point, your cluster management tool need to provide an API around the every day commands like packages, process listing etc. It should return structured data and you could use the structured data to create tools more fit for the purpose of using on large amount of machines. Being that the output is standardized it should provide generic tools that just do the right thing out of the box for you.
With the package example above knowing that all 500 machines had spewed out a bunch of stuff while installing isn’t important, you just want to know the result in a nice way. Here’s what mcollective does:
$ mc-package install zsh
* [ ============================================================> ] 3 / 3
web2.my.net version = zsh-4.2.6-3.el5
web3.my.net version = zsh-4.2.6-3.el5
web1.my.net version = zsh-4.2.6-3.el5
---- package agent summary ----
Nodes: 3 / 3
Versions: 3 * 4.2.6-3.el5
Elapsed Time: 16.33 sIn the case of a package you want to just know the version post the event and a summary of status. Just by looking at the stats I know the desired result was achieved, if I had different versions listed I could very quickly identify the problem ones.
Here’s another example – NRPE this time:
% mc-rpc nrpe runcommand command=check_disks
* [ ============================================================> ] 47 / 47
dev1.my.net Request Aborted
CRITICAL
Exit Code: 2
Performance Data: /=4111MB;3706;3924;0;4361 /boot=26MB;83;88;0;98 /dev/shm=0MB;217;230;0;256
Output: DISK CRITICAL - free space: / 24 MB (0% inode=86%);
Finished processing 47 / 47 hosts in 766.11 msHere notice I didn’t use a NRPE specific mc- command, I just used the generic rpc caller and the caller knows that I am only interesting in seeing the results of machines that are in WARNING or CRITICAL state. If you run this on your console you’d see the ‘Request Aborted’ would be red and the ‘CRITICAL’ would be yellow. Immediately pulling your eye to the important information. Also note how the result shows human friendly field names like ‘Performance Data’.
The formatting, highlighting, knowledge to only show failing resources and human friendly headings all happen automatically, no programming of client side UI is required you get the ability to do this for free simply from the fact that mcollective focuses on putting structure around outputs.
Here’s the earlier package install example with the standard rpc caller not with a specialized package frontend:
% mc-rpc package install package=zsh Determining the amount of hosts matching filter for 2 seconds .... 47 * [ ============================================================> ] 47 / 47 Finished processing 47 / 47 hosts in 2346.05 ms
Everything worked, all 47 machines have the package installed and your desired action was taken. So no point in spamming you with pages of junk, who cares to see all the Yum output? Had an install failed you’d have had usable error message just for the host that failed. The output would be equally usable on one or a thousand hosts with very little margin for human error in knowing the result of your request.
This happens because mcollective has a standard structure of responses, each response has a absolute success value that tells you if the request failed or not and by using this you can get generic CLI, Web, etc tools that displays large amounts of data from a network of hosts in a way that is appropriate and context aware.
For reference here’s the response as received on the client:
{:sender=>"dev1.my.net", :statuscode=>1, :statusmsg=>"CRITICAL", :data=> {:perfdata=> " /=4111MB;3706;3924;0;4361 /boot=26MB;83;88;0;98 /dev/shm=0MB;217;230;0;256", :output=>"DISK CRITICAL - free space: / 24 MB (0% inode=86%);", :exitcode=>2}}
Only by thinking about CLI and admin tasks in this way do I believe we can take the Unix utilities that we call on remote hosts and turn them into something appropriate for large scale parallel use that doesn’t overwhelm the human at the other end with information. Additionally since this is an API that is computer friendly it makes those tools usable in many other places like code deployers – for example to enable your continues deployment using robust use of unix tools via such an API.
There are many other advantages to this approach. Requests are authorized on a very fine level, requests are audited. API wrappers are code that’s versioned, that can be tested in development and makes the margin for error much smaller than just running random unix commands ad hoc. Finally if you’re using the code on a CLI ad-hoc as above or in your continues deployer you share the same code that you’ve already tested and trust.
Projects are evil and must be destroyed
The majority of organisations I’ve worked with deliver new system functionality as development projects. These are funded with capex, and have a start and an end. Even projects that are ‘agile’ are still expected to finish at some date in the future, then once the system has been delivered it will undergo ‘handover’ to ‘BAU’. The project team usually moves on to new projects, developing remarkable cases of mass-amnesia along the way.
Projects deliver exactly what they promise. Project teams have little incentive to invest in the long term operation and maintenance of the systems that they create. I’m not saying that the team doesn’t care or are intentionally acting irresponsibly, but when delivery pressure is applied the first things to be dropped from the project schedule will be the cross-functional concerns that make the system reliable, monitorable, deployable, and maintainable ongoing.
The project effect:
- the project team do not have to live with the long term results of their own architectural and design decisions.
- BAU support/maintenance teams are generally under-resourced, have extremely limited opportunity for handover from project teams, and have to support many different systems. This usually leads to less than ideal development practices and deteriorating quality over time.
- the project team never have to be involved in problem analysis for production outages. They’re never forced to put the right kind of monitoring and logging in place to find root causes.
- the project team only do a limited number of releases to production, so have little incentive to invest in reliable automation or production-like test environments.
Therefore – I believe that many projects are the source of ‘instant legacy’, and a major cause of the development and operations divide.
What’s the alternative? Form long-lived teams around applications/products, or sets of features. A team works from a prioritised backlog of work that contains a mix of larger initiatives, minor enhancements, or BAU-style bug fixes and maintenance. Second-level support should be handled by people in the product team. Everyone in the team should work with common process and a clear understanding of technical design and business vision.
This approach is not easy – it introduces new challenges particularly around balancing priorities and budgeting. I’ve observed that the benefits in terms of long term system health definitely outweigh the drawbacks. Like everything – hire good people who care, and give them the right incentives, good things will happen.
Videos from DevOps Day 2010 panels!

InfoQ.com has posted the videos they recorded at DevOps Day USA 2010. You can watch six of the seven panels now on the InfoQ.com site. There was a production problem with the seventh panel ("DevOps outside of WebOps") that, if it can be fixed, will be posted as well. InfoQ decided that the lightening talks didn't fit into their format so they have sent my co-organizer, Andrew Shafer the raw video and he's going to look into posting them himself.
You can also download audio only versions (.mp3)
Here are the links to the 6 panels...
EDIT: The recording for seventh panel was rescued from technical oblivion and is now live!...
DevOps outside of Web Operations: Much of the public discussion about DevOps focuses on Web Operations. This panel is about taking the lessons of DevOps to other types of IT.
Adam Fletcher - ITA Software
Gene Kim – Tripwire
Michael Stahnke -
James Turnbull – Puppet Labs
moderator: Patrick Debois
http://www.infoq.com/presentations/DevOps-outside-Web-Operations
Marionette Collective version 0.4.8
I just released version 0.4.8 of mcollective. It’s a small maintenance release fixing a few bugs and adding a few features. I wasn’t planning on another 0.4.x release before the big 1.0.0 but want to keep 1.0.0 close as possible to something that’s been out there for a while.
The only major feature it introduces is custom reports of your infrastructure.
It supports two types of scriptlet for building reports. The first is a little DSL that uses printf style format strings:
inventory do format "%s:\t\t%s\t\t%s" fields { [ identity, facts["serialnumber"], facts["productname"] ] } end
Which does something like this:
$ mc-inventory --script hardware.mc web1: KKxxx1H IBM eServer BladeCenter HS20 -[8832M1X]- rep1: KKxxx5Z IBM eServer BladeCenter HS20 -[8832M1X]- db4: KDxxxZY IBM System x3655 -[794334G]- man2: KDxxxR0 eserver xSeries 336 -[88372CY]- db2: KDxxxGD IBM System x3655 -[79855AG]-
The other – perhaps more ugly – is using a Perl like format method. To use this you need the formatr gem installed, and a report might look like this:
formatted_inventory do page_length 20 page_heading <<TOP Node Report @<<<<<<<<<<<<<<<<<<<<<<<<< time Hostname: Customer: Distribution: ------------------------------------------------------------------------- TOP page_body <<BODY @<<<<<<<<<<<<<<<< @<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< identity, facts["customer"], facts["lsbdistdescription"] @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< facts["processor0"] BODY end
And the resulting report is something like this:
$ mc-inventory --script hardware.mc
Node Report Fri Aug 20 21:49:39 +0100
Hostname: Customer: Distribution:
-------------------------------------------------------------------------
web1 rip CentOS release 5.5 (Final)
Intel(R) Xeon(R) CPU L5420
web2 xxxxxxx CentOS release 5.5 (Final)
Intel(R) Xeon(R) CPU X3430The report will be paged 20 nodes per page. The result is very pleasing even if the report format is a bit grim, but it would be much worse to write yet another reporting DSL!
See the full release notes for details on bug fixes and other features.
Rapid Puppet runs with MCollective
The typical Puppet use case is to run the daemon every 30 minutes or so and just let it manage your machines. Sometimes though you want to be able to run it on all your machines as quick as your puppet master can handle.
This is tricky as you generally do not have a way to cap the concurrency and it’s hard to orchestrate that. I’ve extended the MCollective Puppet Agent to do this for you so you can do a rapid run at roll out time and then go back to the more conservative slow pace once your window is over.
The basic logic I implemented is this:
- Discover all nodes, sort them alphabetically
- Count how many nodes are active now, wait till it’s below threshold
- Run a node by just starting a –onetime background run
- Sleep a second
This should churn through your nodes very quickly without overwhelming the resources of your master. You can see it in action here, you can see it started 3 nodes and once it got to the 4th 3 were already running and it waited for one of them to finish:
% mc-puppetd -W /dev_server/ runall 2 Thu Aug 05 17:47:21 +0100 2010> Running all machines with a concurrency of 2 Thu Aug 05 17:47:21 +0100 2010> Discovering hosts to run Thu Aug 05 17:47:23 +0100 2010> Found 4 hosts Thu Aug 05 17:47:24 +0100 2010> Running dev1.one.net, concurrency is 0 Thu Aug 05 17:47:26 +0100 2010> dev1.one.net schedule status: OK Thu Aug 05 17:47:28 +0100 2010> Running dev1.two.net, concurrency is 1 Thu Aug 05 17:47:30 +0100 2010> dev1.two.net schedule status: OK Thu Aug 05 17:47:32 +0100 2010> Running dev2.two.net, concurrency is 2 Thu Aug 05 17:47:34 +0100 2010> dev2.two.net schedule status: OK Thu Aug 05 17:47:35 +0100 2010> Currently 3 nodes running, waiting Thu Aug 05 17:48:00 +0100 2010> Running dev3.two.net, concurrency is 2 Thu Aug 05 17:48:05 +0100 2010> dev3.two.net schedule status: OK
This is integrated into the existing mc-puppetd client script you don’t need to roll out anything new to your servers just the client side.
Using this to run each of 47 machines with a concurrency of just 4 I was able to complete a cycle in 8 minutes. Doesn’t sound too impressive but my average run time is around 40 seconds on every node with some being 90 to 150 seconds. My puppetmaster server that usually sits at a steady 0.2mbit out were serving a constant 2mbit/sec for the duration of this run.
Puppet Camp – San Francisco 2010
Making machine metadata visible
I’m quite the fan of data, metadata and querying these to interact with my infrastructure rather than interacting by hostnames and wanted to show how far I am down this route.
This is more an iterative ongoing process than a fully baked idea at this point since the concept of hostnames is so heavily embedded in our Sysadmin culture. Today I can’t yet fully break away from it due to tools like nagios etc still relying heavily on the hostname as the index but these are things that will improve in time.
The background is that in the old days we attempted to capture a lot of metadata in hostnames, domain names and so forth. This was kind of OK since we had static networks with relatively small amounts of hosts. Today we do ever more complex work on our servers and we have more and more servers. The advent of cloud computing has also brought with it a whole new pain of unpredictable hostnames, rapidly changing infrastructures a much bigger emphasis on role based computing.
My metadata about my machines comes from 3 main sources:
- My Puppet manifests – classes and modules that gets put on a machine
- Facter facts with the ability to add many per machine easily
- MCollective stores the meta data in a MongoDB and let me query the network in real time
Puppet manifests based on query
When setting up machines I keep some data like database master hostnames in extlookup but in many cases I am now moving to a search based approach to finding resources. Here’s a sample manifest that will find the master database for a customers development machines:
$masterdb = search_nodes("{'facts.customer': '${customer}', 'facts.environment':${environment}, classes: 'mysql::master'}")
This is MongoDB query against my infrastructure database, it will find for a given node the name of a node that has the class mysql::master on it, by convention there should be only one per customer in my case. When using it in a template I can get back full objects with all the meta data for a node. Hopefully with Puppet 2.6 I can get full hashes into puppet too!
Making Metadata Visible
With machines doing a lot of work, filling a lot of roles etc and with more and more machines you need to be able to tell immediately what machine you are on.
I do this in several places, first my MOTD can look something like this:
Welcome to Synchronize Your Dogmas
hosted at Hetzner, Germany
Puppet Modules:
- apache
- iptables
- mcollective member
- xen dom0 skeleton
- mw1.xxx.net virtual machineI build this up using snippet from my concat module, each important module like apache can just put something like this in:
motd::register{"Apache Web Server": }
Being managed by my snippet library, if you just remove the include line from the manifests the MOTD will automatically update.
With a big block of welcome done, I now need to also be able to show in my prompts what a machine does, who its for a importantly what environment it is in.

Above a shot of 2 prompts in different environments, you see customer name, environment and major modules. Like with the motd I have a prompt::register define that module use to register into the prompt.
SSH Based on Metadata
With all this meta data in place, mcollective rolled out and everything integrated it’s very easy to now find and access machines based on this.
MCollective does real time resource discovery, so keeping with the mysql example above from puppet:
$ mc-ssh -W "environment=development customer=acme mysql::master" Running: ssh db1.acme.net Last login: Thu Jul 29 00:22:58 2010 from xxxx $
Here i am ssh’ing to a server based on a query, if it found more than one machine matching the query a menu would be presented offering me a choice.
Monitoring Based on Metatdata
Finally setting up monitoring and keeping it in sync with reality can be a big challenge especially in dynamic cloud based environments, again I deal with this through discovery based on meta data:
$ check-mc-nrpe -W "environment=development customer=acme mysql::master" check_load check_load: OK: 1 WARNING: 0 CRITICAL: 0 UNKNOWN: 0|total=1 ok=1 warn=0 crit=0 unknown=0 checktime=0.612054
Summary
This is really the tip of the ice berg, there is a lot more that I already do – like scheduling puppet runs on groups of machines based on metadata – but also a lot more to do this really is early days down this route. I am very keen to get views from others who are struggling with shortcomings in hostname based approaches and how they deal with it.
Monitoring ActiveMQ
I have a number of ActiveMQ servers, 7 in total, 3 in a network of brokers the rest standalone. For MCollective I use topics extensively so don’t really need to monitoring them much other than for availability. I also though do a lot of Queued work where lots of machines put data in a queue and others process the data.
In the Queue scenario you absolutely need to monitor queue sizes, memory usage and such. You also need to graph things like rates of messages, consumer counts and memory use. I am busy writing a number of Nagios and Cacti plugins to help with this, you can find them on Github.
To use these you need to have the ActiveMQ Statistics Plugin enabled.
First we need to monitor queue sizes:
$ check_activemq_queue.rb --host localhost --user nagios --password passw0rd --queue exim.stats --queue-warn 1000 --queue-crit 2000 OK: ActiveMQ exim.stats has 1 messages
This will connect to localhost monitoring a queue exim.stats warning you when it’s got 1000 messages and critical at 2000.
I need to add to this the ability to monitor memory usage, this will come over the next few days.
I also have a plugin for Cacti it can output stats for the broker as a whole and also for a specific queue. First the whole broker:
$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report broker stomp+ssl:stomp+ssl storePercentUsage:81 size:5597 ssl:ssl vm:vm://web3 dataDirectory:/var/log/activemq/activemq-data dispatchCount:169533 brokerName:web3 openwire:tcp://web3:6166 storeUsage:869933776 memoryUsage:1564 tempUsage:0 averageEnqueueTime:1623.90502285799 enqueueCount:174080 minEnqueueTime:0.0 producerCount:0 memoryPercentUsage:0 tempLimit:104857600 messagesCached:0 consumerCount:2 memoryLimit:20971520 storeLimit:1073741824 inflightCount:9 dequeueCount:169525 brokerId:ID:web3-44651-1280002111036-0:0 tempPercentUsage:0 stomp:stomp://web3:6163 maxEnqueueTime:328585.0 expiredCount:0
Now a specific queue:
$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report exim.stats size:0 dispatchCount:168951 memoryUsage:0 averageEnqueueTime:1629.42897052992 enqueueCount:168951 minEnqueueTime:0.0 consumerCount:1 producerCount:0 memoryPercentUsage:0 destinationName:queue://exim.stats messagesCached:0 memoryLimit:20971520 inflightCount:0 dequeueCount:168951 expiredCount:0 maxEnqueueTime:328585.0
Grab the code on GitHub and follow there, I expect a few updates in the next few weeks.
DevOps (live) at OSCON
Early reports from OSCON are that DevOps is a topic of much discussion. My fellow dev2ops.org contributor Alex Honor and I are headed to Portland this morning to give DevOps related talks at OSCON. If you are there Wednesday or Thursday, please come by and say hello!
Wednesday (7/21) 1:40pm in room Portland 251 is Alex's presentation...
Open Source Tool Chains for Cloud Computing
Thursday (7/22) 10:40am in room D135 is Damon's presentation...
The IT Philharmonic: How Out of Tune Are Your Operations?
Both talks feature lots of new content (even though the titles and outdated descriptions on the OSCON site are similar to our Velocity talks)
