↓ Archives ↓

Category → system administration

Command-line cookbook dependency solving with knife exec

Note: This article was originally published in 2011. In response to demand, I've updated it for 2014! Enjoy! SNS

Imagine you have a fairly complicated infrastructure with a large number of nodes and roles. Suppose you have a requirement to take one of the nodes and rebuild it in an entirely new network, perhaps even for a completely different organization. This should be easy, right? We have our infrastructure in the form of code. However, our current infrastructure has hundreds of uploaded cookbooks - how do we know the minimum ones to download and move over? We need to find out from a node exactly what cookbooks are needed for that node to be built.

The obvious place to start is with the node itself:

$ knife node show controller
Node Name:   controller
Environment: _default
FQDN:        controller
IP:          182.13.194.41
Run List:    role[base], recipe[apt::cacher], role[pxe_server]
Roles:       pxe_server, base
Recipes      apt::cacher, pxe_dust::server, dhcp, dhcp::config
Platform:    ubuntu 10.04

OK, this tells us we need the apt, pxe_dust and dhcp cookbooks. But what about them - do they have any dependencies? How could we find out? Well, dependencies are specified in two places - in the cookbook metadata, and in the individual recipes. Here's a primitive way to illustrate this:

bash-3.2$ for c in apt pxe_dust dhcp
> do
> grep -iER 'include_recipe|^depends' $c/* | cut -d '"' -f 2 | sort | uniq
> done
apt::cacher-client
apache2
pxe_dust::server
tftp
tftp::server
utils

As I said - primitive. However the problem doesn't end here. In order to be sure, we now need to repeat this for each dependency, recursively. And of course it would be nice to present them more attractively. Thinking about it, it would be rather useful to know what cookbook versions are in use too. This is definitely not a job for a shell one liner - is there a better way?

As it happens, there is. Think about it - the Chef server already needs to solve these dependencies to know what cookbooks to push to API clients. Can we access this logic? Of course we can - clients carry out all their interactions with the Chef server via the API. This means we can let the server solve the dependencies and query it via the API ourselves.

Chef provides two powerful ways to access the API without having to write a RESTful client. The first, Shef, is an interactive REPL based on IRB, which when launched gives access to the Chef server. This isn't trivial to use. The second, much simpler way is the knife exec subcommand. This allows you to write Ruby scripts or simple one-liners that are executed in the context of a fully configured Chef API Client using the knife configuration file.

Now, since I wrote this article, back in summer 2011, the API has changed, which means that my original method no longer works. Additionally, we are now served by at least two local dependency solvers, in the form of Berkshelf (whose dependency solver, 'solve' is now available as an individual Gem), and Librarian-chef. In this updated version, I'll show how to use the new Chef server API to perform the same function. Berkshelf and Librarian solve a slightly different problem, in that in this instance we're trying to solve dependencies for a node, so for the purposes of this article I'll consider them out of scope.

For historical purposes, here's the original solution:

knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'

The /nodes/NODE_NAME/cookbooks endpoint returns the cookbook attributes, definitions, libraries and recipes that are required for this node. The response is a hash of cookbook name and Chef::CookbookVersion object. We simply iterate over each one, and pretty print the cookbook name and the version.

Let's give it a try:

$ knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'
{"apt"=>"1.1.1"}
{"tftp"=>"0.1.0"}
{"apache2"=>"0.99.3"}
{"dhcp"=>"0.1.0"}
{"utils"=>"0.9.5"}
{"pxe_dust"=>"1.1.0"}

The current way to solve dependencies using the Chef server API resides under the environments end point. This makes sense, if you think of environments as a way to define and constrain version numbers for a given set of nodes. This means that constructing the API call, and handling the results is slightly more than can easily be comprehended in a one-liner, which gives us the opportunity to demonstrate the use of knife exec with a script on the filesystem.

First let's create the script:

USAGE = "knife exec script.rb NODE_NAME"

def usage_and_exit
  STDERR.puts USAGE
  exit 1
end

node_name = ARGV[2]

usage_and_exit unless node_name

node = api.get("nodes/#{node_name}")
run_list_expansion = node.expand!("server")

cookbook_solution = api.post("environments/#{node.chef_environment}/cookbook_versions",
                            :run_list => run_list_expansion.recipes)

cookbook_solution.each do |name, cb|
  puts name + " => " + cb.version
end

exit

The way knife exec scripts work is to pass the arguments following knife to Ruby as the ARGV special variable, which is an array of each space-separated argument. This allows us to produce a slightly more general solution, to which we can pass the name of the node for which we want to solve. The usage handling is obvious - we print the usage to stderr if the command is called without a node name. The meat of the script is the API call. First we get the node object (from ARGV[2], i.e. the node we passed to the script) from the Chef server. Next we expand the run list - this means check for and expand any run lists in roles. Finally we call the API to provide us with cookbook versions for the specified node in the environment in which the node currently resides, passing in the recipes from the expanded run list. Finally we iterate over the cookbooks we get back, and print the name and version. Note that this script could easily be modified to solve for a different environment, which would be handy if we wanted to confirm what versions we'd get were we to move the node to a different environment. Let's give in a whirl:

$ knife exec src/knife-cookbook-solve/solve.rb asl-dev-1
chef_handler => 1.1.4
minitest-handler => 0.1.3
base => 0.0.2
hosts => 0.0.1
yum => 2.3.0
tmux => 1.1.1
ssh => 0.0.6
fail2ban => 1.2.2
users => 2.0.6
security => 0.1.0
sudo => 2.0.4
atalanta-users => 0.0.2
community_users => 1.5.1
sudoersd => 0.0.2
build-essential => 1.4.2

To conclude as did the original article....Nifty! :)

Command-line cookbook dependency solving with knife exec

Imagine you have a fairly complicated infrastructre with a large number of nodes and roles. Suppose you have a requirement to take one of the nodes and rebuild it in an entirely new network, perhaps even for a completely different organization. This should be easy, right? We have our infrastructure in the form of code. However, our current infrastructure has hundreds of uploaded cookbooks - how do we know the minimum ones to download and move over? We need to find out from a node exactly what cookbooks are needed for that node to be built.

The obvious place to start is with the node itself:

$ knife node show controller
Node Name:   controller
Environment: _default
FQDN:        controller
IP:          182.13.194.41
Run List:    role[base], recipe[apt::cacher], role[pxe_server]
Roles:       pxe_server, base
Recipes      apt::cacher, pxe_dust::server, dhcp, dhcp::config
Platform:    ubuntu 10.04

OK, this tells us we need the apt, pxe_dust and dhcp cookbooks. But what about them - do they have any dependencies? How could we find out? Well, dependencies are specified in two places - in the cookbook metadata, and in the individual recipes. Here's a primitive way to illustrate this:

bash-3.2$ for c in apt pxe_dust dhcp
> do
> grep -iER 'include_recipe|^depends' $c/* | cut -d '"' -f 2 | sort | uniq
> done
apt::cacher-client
apache2
pxe_dust::server
tftp
tftp::server
utils

As I said - primitive. However the problem doesn't end here. In order to be sure, we now need to repeat this for each dependency, recursively. And of course it would be nice to present them more attractively. Thinking about it, it would be rather useful to know what cookbook versions are in use too. This is definitely not a job for a shell one liner - is there a better way?

As it happens, there is. Think about it - the Chef server already needs to solve these dependencies to know what cookbooks to push to API clients. Can we access this logic? Of course we can - clients carry out all their interactions with the Chef server via the API. This means we can let the server solve the dependencies and query it via the API ourselves.

Chef provides two powerful ways to access the API without having to write a RESTful client. The first, Shef, is an interactive REPL based on IRB, which when launched gives access to the Chef server. This isn't trivial to use. The second, much simpler way is the knife exec subcommand. This allows you to write Ruby scripts or simple one-liners that are executed in the context of a fully configured Chef API Client using the knife configuration file.

knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'

The /nodes/NODE_NAME/cookbooks endpoint returns the cookbook attributes, definitions, libraries and recipes that are required for this node. The response is a hash of cookbook name and Chef::CookbookVersion object. We simply iterate over each one, and pretty print the cookbook name and the version.

Let's give it a try:

$ knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'
{"apt"=>"1.1.1"}
{"tftp"=>"0.1.0"}
{"apache2"=>"0.99.3"}
{"dhcp"=>"0.1.0"}
{"utils"=>"0.9.5"}
{"pxe_dust"=>"1.1.0"}

Nifty! :)

Command-line cookbook dependency solving with knife exec

Imagine you have a fairly complicated infrastructre with a large number of nodes and roles. Suppose you have a requirement to take one of the nodes and rebuild it in an entirely new network, perhaps even for a completely different organization. This should be easy, right? We have our infrastructure in the form of code. However, our current infrastructure has hundreds of uploaded cookbooks - how do we know the minimum ones to download and move over? We need to find out from a node exactly what cookbooks are needed for that node to be built.

The obvious place to start is with the node itself:

$ knife node show controller
Node Name:   controller
Environment: _default
FQDN:        controller
IP:          182.13.194.41
Run List:    role[base], recipe[apt::cacher], role[pxe_server]
Roles:       pxe_server, base
Recipes      apt::cacher, pxe_dust::server, dhcp, dhcp::config
Platform:    ubuntu 10.04

OK, this tells us we need the apt, pxe_dust and dhcp cookbooks. But what about them - do they have any dependencies? How could we find out? Well, dependencies are specified in two places - in the cookbook metadata, and in the individual recipes. Here’s a primitive way to illustrate this:

bash-3.2$ for c in apt pxe_dust dhcp
> do
> grep -iER 'include_recipe|^depends' $c/* | cut -d '"' -f 2 | sort | uniq
> done
apt::cacher-client
apache2
pxe_dust::server
tftp
tftp::server
utils

As I said - primitive. However the problem doesn’t end here. In order to be sure, we now need to repeat this for each dependency, recursively. And of course it would be nice to present them more attractively. Thinking about it, it would be rather useful to know what cookbook versions are in use too. This is definitely not a job for a shell one liner - is there a better way?

As it happens, there is. Think about it - the Chef server already needs to solve these dependencies to know what cookbooks to push to API clients. Can we access this logic? Of course we can - clients carry out all their interactions with the Chef server via the API. This means we can let the server solve the dependencies and query it via the API ourselves.

Chef provides two powerful ways to access the API without having to write a RESTful client. The first, Shef, is an interactive REPL based on IRB, which when launched gives access to the Chef server. This isn’t trivial to use. The second, much simpler way is the knife exec subcommand. This allows you to write Ruby scripts or simple one-liners that are executed in the context of a fully configured Chef API Client using the knife configuration file.

knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'

The /nodes/NODE_NAME/cookbooks endpoint returns the cookbook attributes, definitions, libraries and recipes that are required for this node. The response is a hash of cookbook name and Chef::CookbookVersion object. We simply iterate over each one, and pretty print the cookbook name and the version.

Let’s give it a try:

$ knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'
{"apt"=>"1.1.1"}
{"tftp"=>"0.1.0"}
{"apache2"=>"0.99.3"}
{"dhcp"=>"0.1.0"}
{"utils"=>"0.9.5"}
{"pxe_dust"=>"1.1.0"}

Nifty! :)

Command-line cookbook dependency solving with knife exec

Imagine you have a fairly complicated infrastructre with a large number of nodes and roles. Suppose you have a requirement to take one of the nodes and rebuild it in an entirely new network, perhaps even for a completely different organization. This should be easy, right? We have our infrastructure in the form of code. However, our current infrastructure has hundreds of uploaded cookbooks - how do we know the minimum ones to download and move over? We need to find out from a node exactly what cookbooks are needed for that node to be built.

The obvious place to start is with the node itself:

$ knife node show controller
Node Name:   controller
Environment: _default
FQDN:        controller
IP:          182.13.194.41
Run List:    role[base], recipe[apt::cacher], role[pxe_server]
Roles:       pxe_server, base
Recipes      apt::cacher, pxe_dust::server, dhcp, dhcp::config
Platform:    ubuntu 10.04

OK, this tells us we need the apt, pxe_dust and dhcp cookbooks. But what about them - do they have any dependencies? How could we find out? Well, dependencies are specified in two places - in the cookbook metadata, and in the individual recipes. Here’s a primitive way to illustrate this:

bash-3.2$ for c in apt pxe_dust dhcp
> do
> grep -iER 'include_recipe|^depends' $c/* | cut -d '"' -f 2 | sort | uniq
> done
apt::cacher-client
apache2
pxe_dust::server
tftp
tftp::server
utils

As I said - primitive. However the problem doesn’t end here. In order to be sure, we now need to repeat this for each dependency, recursively. And of course it would be nice to present them more attractively. Thinking about it, it would be rather useful to know what cookbook versions are in use too. This is definitely not a job for a shell one liner - is there a better way?

As it happens, there is. Think about it - the Chef server already needs to solve these dependencies to know what cookbooks to push to API clients. Can we access this logic? Of course we can - clients carry out all their interactions with the Chef server via the API. This means we can let the server solve the dependencies and query it via the API ourselves.

Chef provides two powerful ways to access the API without having to write a RESTful client. The first, Shef, is an interactive REPL based on IRB, which when launched gives access to the Chef server. This isn’t trivial to use. The second, much simpler way is the knife exec subcommand. This allows you to write Ruby scripts or simple one-liners that are executed in the context of a fully configured Chef API Client using the knife configuration file.

knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'

The /nodes/NODE_NAME/cookbooks endpoint returns the cookbook attributes, definitions, libraries and recipes that are required for this node. The response is a hash of cookbook name and Chef::CookbookVersion object. We simply iterate over each one, and pretty print the cookbook name and the version.

Let’s give it a try:

$ knife exec -E '(api.get "nodes/controller/cookbooks").each { |cb| pp cb[0] => cb[1].version }'
{"apt"=>"1.1.1"}
{"tftp"=>"0.1.0"}
{"apache2"=>"0.99.3"}
{"dhcp"=>"0.1.0"}
{"utils"=>"0.9.5"}
{"pxe_dust"=>"1.1.0"}

Nifty! :)

Why Monitoring Sucks

Why Monitoring Sucks (and what we're doing about it)

About two weeks ago someone made a tweet. At this point, I don't remember who said it but the gist was that "monitoring sucks". I happened to be knee-deep in frustrating bullshit around that topic and was currently evaluating the same effing tools I'd evaluated at every other company over the past 10 years or so. So I did what seems to be S.O.P for me these days. I started something.

But does monitoring REALLY suck?

Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it. Here's what I don't love: - Having my hands tied with the model of host and service bindings. - Having to set up "fake" hosts just to group arbitrary metrics together - Having to either collect metrics twice - once for alerting and another for trending - Only being able to see my metrics in 5 minute intervals - Having to chose between shitty interface but great monitoring or shitty monitoring but great interface - Dealing with a monitoring system that thinks IT is the system of truth for my environment - Perl (I kid...sort of) - Not actually having any real choices

Yes, yes I know:

You can just combine Nagios + collectd + graphite + cacti + pnp4nagios and you have everything you need!

Seriously? Kiss my ass. I'm a huge fan of the Unix pipeline philosophy but, christ, have you ever heard the phrase "antipattern"?

So what the hell are you going to do about it?

I'm going to let smart people be smart and do smart things.

Step one was getting everyone who had similar complaints together on IRC. That went pretty damn well. Step two was creating a github repo. Seriously. Step two should ALWAYS be "create a github repo". Step three? Hell if I know.

Here's what I do know. There are plenty of frustrated system administrators, developers, engineers, "devops" and everything under the sun who don't want much. All they really want is for shit to work. When shit breaks, they want to be notified. They want pretty graphs. They want to see business metrics along side operational ones. They want to have a 52-inch monitor in the office that everyone can look at and say:

See that red dot? That's bad. Here's what was going on when we got that red dot. Let's fix that shit and go get beers

About the "repo"

So the plan I have in place for the repository is this. We don't really need code. What we need is an easy way for people to contribute ideas. The plan I have in place for this is partially underway. There's now a monitoringsucks organization on Github. Pretty much anyone who is willing to contribute can get added to the team. The idea is that, as smart people think of smart shit, we can create new repository under some unifying idea and put blog posts, submodules, reviews, ideas..whatever into that repository so people have an easy place to go get information. I'd like to assign someone per repository to be the owner. We're all busy but this is something we're all highly interested in. If we spread the work out and allow easy contribution, then we can get some real content up there.

I also want to keep the repos as light and cacheable as possible. The organization is under the github "free" plan right now and I'd like to keep it that way.

Blog Posts Repo

This repo serves as a place to collect general information about blog posts people come across. Think of it as hyper-local delicious in a DVCS.

Currently, by virtue of the first commit, Michael Conigliaro is the "owner". You can follow him on twitter and github as @mconigliaro

IRC Logs Repo

This repo is a log of any "scheduled" irc sessions. Personally, I don't think we need a distinct #monitoringsucks channel but people want to keep it around. The logs in this repo are not full logs. Just those from when someone says "Hey smart people. Let's think of smart shit at this date/time" on twitter.

Currently I appear to be the owner of this repo. I would love for someone who can actually make the logs look good to take this over.

Tools Repo

This repo is really more of a "curation" repo. The plan is that each directory is the name of some tool with two things it in:

  • A README.md as a review of the tool
  • A submodule link to the tool's repo (where appropriate)

Again, I think I'm running point on this one. Please note that the submodule links APPEAR to have some sort of UI issue on github. Every submodule appears to point to Dan DeLeo's 'critical' project.

Metrics Catalog Repo

This is our latest member and it already has an official manager! Jason Dixon (@obfuscurity on github/twitter - jdixon on irc) suggested it so he get's to run it ;) The idea here is that this will serves as a set of best practices around what metrics you might want to collect and why. I'm leaving the organization up to Jason but I suggested a per-app/service/protocol directory.

Wrap Up

So that's where we are. Where it goes, I have no idea. I just want to help where ever I can. If you have any ideas, hit me up on twitter/irc/github/email and let me know. It might help to know that if you suggest something, you'll probably be made the person reponsible for it ;)

Update!

It was our good friend Sean Porter (@portertech on twitter), that we have to thank for all of this ;)

From Public Photos

Update (again)

It was kindly pointed out that I never actually included a link to the repositories. Here they are:

https://github.com/monitoringsucks

Why Monitoring Sucks

Why Monitoring Sucks (and what we're doing about it)

About two weeks ago someone made a tweet. At this point, I don't remember who said it but the gist was that "monitoring sucks". I happened to be knee-deep in frustrating bullshit around that topic and was currently evaluating the same effing tools I'd evaluated at every other company over the past 10 years or so. So I did what seems to be S.O.P for me these days. I started something.

But does monitoring REALLY suck?

Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it. Here's what I don't love: - Having my hands tied with the model of host and service bindings. - Having to set up "fake" hosts just to group arbitrary metrics together - Having to either collect metrics twice - once for alerting and another for trending - Only being able to see my metrics in 5 minute intervals - Having to chose between shitty interface but great monitoring or shitty monitoring but great interface - Dealing with a monitoring system that thinks IT is the system of truth for my environment - Perl (I kid...sort of) - Not actually having any real choices

Yes, yes I know:

You can just combine Nagios + collectd + graphite + cacti + pnp4nagios and you have everything you need!

Seriously? Kiss my ass. I'm a huge fan of the Unix pipeline philosophy but, christ, have you ever heard the phrase "antipattern"?

So what the hell are you going to do about it?

I'm going to let smart people be smart and do smart things.

Step one was getting everyone who had similar complaints together on IRC. That went pretty damn well. Step two was creating a github repo. Seriously. Step two should ALWAYS be "create a github repo". Step three? Hell if I know.

Here's what I do know. There are plenty of frustrated system administrators, developers, engineers, "devops" and everything under the sun who don't want much. All they really want is for shit to work. When shit breaks, they want to be notified. They want pretty graphs. They want to see business metrics along side operational ones. They want to have a 52-inch monitor in the office that everyone can look at and say:

See that red dot? That's bad. Here's what was going on when we got that red dot. Let's fix that shit and go get beers

About the "repo"

So the plan I have in place for the repository is this. We don't really need code. What we need is an easy way for people to contribute ideas. The plan I have in place for this is partially underway. There's now a monitoringsucks organization on Github. Pretty much anyone who is willing to contribute can get added to the team. The idea is that, as smart people think of smart shit, we can create new repository under some unifying idea and put blog posts, submodules, reviews, ideas..whatever into that repository so people have an easy place to go get information. I'd like to assign someone per repository to be the owner. We're all busy but this is something we're all highly interested in. If we spread the work out and allow easy contribution, then we can get some real content up there.

I also want to keep the repos as light and cacheable as possible. The organization is under the github "free" plan right now and I'd like to keep it that way.

Blog Posts Repo

This repo serves as a place to collect general information about blog posts people come across. Think of it as hyper-local delicious in a DVCS.

Currently, by virtue of the first commit, Michael Conigliaro is the "owner". You can follow him on twitter and github as @mconigliaro

IRC Logs Repo

This repo is a log of any "scheduled" irc sessions. Personally, I don't think we need a distinct #monitoringsucks channel but people want to keep it around. The logs in this repo are not full logs. Just those from when someone says "Hey smart people. Let's think of smart shit at this date/time" on twitter.

Currently I appear to be the owner of this repo. I would love for someone who can actually make the logs look good to take this over.

Tools Repo

This repo is really more of a "curation" repo. The plan is that each directory is the name of some tool with two things it in:

  • A README.md as a review of the tool
  • A submodule link to the tool's repo (where appropriate)

Again, I think I'm running point on this one. Please note that the submodule links APPEAR to have some sort of UI issue on github. Every submodule appears to point to Dan DeLeo's 'critical' project.

Metrics Catalog Repo

This is our latest member and it already has an official manager! Jason Dixon (@obfuscurity on github/twitter - jdixon on irc) suggested it so he get's to run it ;) The idea here is that this will serves as a set of best practices around what metrics you might want to collect and why. I'm leaving the organization up to Jason but I suggested a per-app/service/protocol directory.

Wrap Up

So that's where we are. Where it goes, I have no idea. I just want to help where ever I can. If you have any ideas, hit me up on twitter/irc/github/email and let me know. It might help to know that if you suggest something, you'll probably be made the person reponsible for it ;)

Update!

It was our good friend Sean Porter (@portertech on twitter), that we have to thank for all of this ;)

From Public Photos

Update (again)

It was kindly pointed out that I never actually included a link to the repositories. Here they are:

https://github.com/monitoringsucks

Building a Devops team

This is a guest post by Brian Henerey, from Sony Computer Entertainment Europe.

Background

I've had 3 roles at Sony since joining in August 2008. Nearly a year ago I took over the management of the original engineering team I joined. This was a failing team by any definition, but I was excited about the opportunity to reshape it. I knew the remaining team was deeply unhappy and likely to quit at any moment, so I had a few immediate goals:

  • Hire!
  • Keep people from quitting.
  • Hire!

Side story: I stumbled on one important objective I didn't list however. Keep customers happy. It doesn't matter how awesome you think your team can be if no one wants to work with you based on past experiences. I didn't appreciate how much a demotivated employee could jeopardise customer relationships by virtue of not caring. It has taken me months to restore trust with one customer. I've heard a story about a manager offering employees £500 to quit on a regular basis. I think that probably has some practical problems, but its a tempting idea to cull the unmotivated.

I come from a long background of small/medium size enterprises. It has been a challenge adapting to a large corporation, but I don't think there's much unique to Sony about the anti-Devops patterns I've encountered. I know several people in small companies who says they've been practicing Devops before there was such a word and I completely agree. The trouble of silos, bureaucracy, organizational boundaries, politics, etc, seem pretty common in larger businesses though. I can't speak to how to create a Devops culture across a large organisation from the top down, but I've been working really hard to create one from the inside.

The beginning

A year ago I'd never heard of the term Devops. If you're in the same boat, it is easy to find a great deal to read about what Devops is:

And what it is not:

However, I suspect some people will have trouble finding the read-worthy gems amongst all the chatter. Here's a good place to get started: getting started with devops. The gigantic list of Devops related bookmarks compiled by Patrick Debois shows why you may not want to try and read everything: devops bookmarks

If you're in the know already and Devops resonates with you, and you want to build a team around the concept, here's how I went about it.

Networking

The terms Devops didn't really take shape for me until I started to talk about it with others. Fortunately, London has a really active Devops community so I've had ample opportunity. The tireless Gareth Rushgrove organises many events, and The Guardian is a frequent host. I've been to sessions discussing Continuous Integration, Deployments, Google App Engine, Load Balancers, Chef, CloudFoundry, etc. I've found people to be incredibly open about technology, processes, culture, difficulties and successes they've had.

While Devops is of course about more than technology and tools, I personally have found Devops to be an excellent banner under which to have really interesting conversations. Having a forum which brings people from diverse backgrounds together has helped me shape my own internal understanding of what Devops should be about.

I felt a bit of an imposter going to the initial London Devops meetups because I was so keen on recruiting. However, the quality of the discussions has been so good I eagerly anticipate each upcoming meetup even though I'm no longer hiring. I've also discovered that half the attendees are also hiring. It's a Devopsee's market.

Result!: I met and subsequently hired Stephen Nelson-Smith from Atalanta-Systems. (He's @Lordcope on twitter, and the author of agilesysadmin.net

Working definition of Devops

If you're going to hire people with Devops in mind, its good to have a working definition. I like the pillars of Devops (CAMS) put forth by John Willis: what devops means to me

  • Culture
  • Automation
  • Measurement
  • Sharing

SMAC might have been a better acronym, but I'll go with CAMS.

A Devops job spec

I don't think Devops is a role, though I've seen jobs posting for such a thing. I only mentioned that I was looking for someone 'Devops-savvy', and later changed it to 'Devops-minded' or something similar. The job posting expired and I'd have to dig it out, but R.I.Pinearr described in on Twitter as the 'perfect devops job posting'. I'm pretty keen on revising a job spec until the requirements are only things I actually require and can measure against. Saying that, how to write a job spec is way outside the scope of this post. To summarize, I was looking for:

  • problem solving skills
  • 'can do' attitude
  • good team fit (really hard to quantify)
  • a broad set of skills (LAMP, Java, C++, Ruby, Python, Oracle, Scaling/Capacity, High-Availability, etc, etc)

My team works on a ton of different technology stacks, and the landscape is constantly changing. Its a techie-dream job, but the interpersonal skills are the most important.

Recruiters

I strongly believe in giving recruiters a fair bit of my time. I've seen many people be rude to recruiters, ignore them, etc, and then wonder why they don't get good candidates through. I'm quite keen on engaging the recruiters, explaining the role I'm trying to fill thoroughly, and having the occasional coffee or beer with them. Feedback is of course vital to candidates, and I try to give it honestly and quickly, letting the recruiter worry about sugar coating things.

CV selection

This is tough. I regularly get CV blindness where everyone starts to look the same. And generally ill-suited. I try to remember there are human beings on the other end and force myself to have concrete reasons why I'm rejecting someone. Talking to a recruiter about this helps me be concrete.

First interview - remote technical test

This is where things get interesting! I don't know if this is unique to London, but I've had a LOT of candidates from other countries apply to join this team. If someone has a good CV and the recruiter vouches for their English language skills, I developed a great screening test which can be conducted remotely. This saves a trip to London + hotel, and I can end it promptly if things aren't going well. Here's how it works:

  • I email the candidate/recruiter a url to an ec2 instance that I spin up on the day about 20 minutes before the interview.
  • The instance is running a web browser which contains instructions for the test. These only state that the candidate will need a terminal such as Putty if they're on Windows.
  • At the arranged time I phone the candidate. I explain that there will be two tests. The first is a sys admin task which will be time bound to 20 minutes. The second is a programming task which they can use the remainder of the time to complete. The call will end after 1 hour.
  • I explain the rules: They are to perform all of their work on the ec2 instance. They have a test account/password, and sudo root access. They can use any resources they want to solve the problems. Google, man pages, libraries are not only fair game, but fully expected.
  • I explain what I want from them: They need to talk to me, tell me what they are thinking, and walk me through the problem solving process. I'm far more interested in that dialogue than whether they solve either problem I give them.
  • I also add that we're using Screen, and I can see everything they type.
  • I swap the index.html with the complete instructions in place, make note of the time, and let them begin.

The problems

1) Its really quite simple: install Wordpress and configure it to work properly. The catch is that we install mysql first, break it, and then watch as candidates wonder what the heck is going on. For an experienced sysadmin this is child's play. I tended to interview people with stronger development background and less familiar installing applications. I could tell almost immediately how well someone knew there way around a Linux system. It was interesting to see what kinds of assumptions people made about the system itself (I never mentioned the OS that was running. Several just assumed Ubuntu.) Some people read instructions, some don't. I give people the mysqladmin password, but some people search on how to reset a lost password because they didn't read what I gave them. I had one guy spend 10 minutes trying to ssh to http://ec2....... I gave him a pass on nerves, but he continued to suck and I ended it soon there after. He blamed language barrier (Eastern European), and said if only I had been more clear to him. If I can't communicate with him, I think that's a pretty big problem and it doesn't really matter who's fault it is.

2) We provide sanitized Production Tomcat logs for a real application we support and ask the candidate to write a log parsing script in a language of their choice. We want the output of the script to show methods calls, call counts, frequencies, average and 90% latencies. Our preference is Ruby, but they can do it however they'd like. I had one candidate choose to implement this in Bash and was writing some serious regex-fu that I had no idea how it worked. He got stuck however, and I couldn't help but ask as he claimed to be a Ruby developer why he didn't do it in Ruby, which was my stated preference. He started over in Ruby and did okay. Depending how much time was spent on problem 1, this part of the interview is really boring for me. I stay on the phone in case they have questions, I ask them to explain their approach before they begin coding, but then I just start checking email/etc. After 60 minutes total is up, I explain to the candidate that they can continue working on the coding task as long as they need and to send me an email when they've finished. I get off the phone however, stating that we'll give them feedback as soon as we've reviewed the code they submit and explain the next steps.

Results

I put several candidates through this process. In the beginning of creating this test, I'd have a couple members of my team on this call as well, but we found this too time consuming and a bit intimidating to certain candidates. Timeboxing problem 1 was a HUGE improvement, and once Stephen Nelson-Smith was on board I had someone better than me at evaluating the Ruby code. We all felt this test process was extremely revealing of candidates skillsets and I highly recommend it.

One of my favourite candidates conducted this interview on a laptop in the shared wifi area of a crowded and noisy London hostel. In the background were screaming people and overbearing Christmas music. He was able to tune out the distractions and nailed both problems with ease, and got major bonus points for doing so.

Round 2 - Face to face interview

Round 2 actually has a few parts:

  • Coffee/lunch/dinner informal chat up to 1 hour in length. I explain what I'm looking for; they can talk about themselves; we can find out if we have a good match.
  • Hypothetical whiteboard problem solving exercise: You receive a call saying customer goes to http://yoursite.com and gets a blank page. What do you do next? We can improvise a bit here on what the actual problem is, but we're hoping to learn two things: How does this person approach problem solving? What level of architectural complexity have they been exposed to?
  • 2 hours of pair programming with a member of my team. This is usually a real bit of work that needs doing. It could be writing a chef cookbook, or a cucumber test, etc. We want to learn what its like to work closely with this person. My team pair programs often. Do we want to pair with this person day in / day out?

Round 3 - my boss + any member of my team who hasn't met the candidate yet.

  • This is generally very open, though my boss has her own techniques for evaluating people.

Its very important to me that everyone on my team have a voice. I was quite keen on one candidate, but when one of my team member's voiced vague concerns about the person's team-fit, we all stopped and took it on board. We rejected the candidate in the end because once the first doubts were out in the open, other people's concerns started to be raised as well. I recognised that I was a bit too keen to hire someone to fill a pressing need and am glad how things worked out..

A GREAT candidate/hire

One of my favourite hires not only does he know C, Java, and Linux, but wrote a sample Ruby application because he knew we were looking to hire Ruby skills within the team. His app worked out the shortest path between tube stations, though only in terms of number of stops, not time travelled. This initiative told me a lot about him, and its been 100% the same since he joined the team. Eager to learn and try new things. Any problem/task put in front of him is 'easy'. My only trouble is he tends to consider problems solved when he's worked out in his head how he will solve it. This is a bit of a joke really. I accused him the other day of declaring checkmate on a task because he was so confident it would be completed in his next 7 seven steps.

Beyond hiring

Now what? Well, hiring the right people is HUGE. We celebrated each hire, as opposed to the typical 'leaving drinks' when people move on. How I manage the team will be a future blog post (I hope), but I'll add one quick comment. Hiring people according to the vision I had means that I am held accountable as well. Whenever I find myself explaining that the reason for a decision I'm making is 'politics', I know I have to change.

About the author

Image

Brian Henerey heads up Operations Engineering in the Online Technology Group at Sony Computer Entertainment Europe. His passions include Devops, Tool-chains, Web Operations, Continuous Delivery and Lean thinking. He's currently building automated infrastructure pipelines with Ruby, Chef, and AWS, enabling self-service, just-in-time development and test environments for Sony's Worldwide Studios.

Image Image

Building a Devops team

This is a guest post by Brian Henerey, from Sony Computer Entertainment Europe.

Background

I’ve had 3 roles at Sony since joining in August 2008. Nearly a year ago I took over the management of the original engineering team I joined. This was a failing team by any definition, but I was excited about the opportunity to reshape it. I knew the remaining team was deeply unhappy and likely to quit at any moment, so I had a few immediate goals:

  • Hire!
  • Keep people from quitting.
  • Hire!

Side story: I stumbled on one important objective I didn’t list however. Keep customers happy. It doesn’t matter how awesome you think your team can be if no one wants to work with you based on past experiences. I didn’t appreciate how much a demotivated employee could jeopardise customer relationships by virtue of not caring. It has taken me months to restore trust with one customer. I’ve heard a story about a manager offering employees £500 to quit on a regular basis. I think that probably has some practical problems, but its a tempting idea to cull the unmotivated.

I come from a long background of small/medium size enterprises. It has been a challenge adapting to a large corporation, but I don’t think there’s much unique to Sony about the anti-Devops patterns I’ve encountered. I know several people in small companies who says they’ve been practicing Devops before there was such a word and I completely agree. The trouble of silos, bureaucracy, organizational boundaries, politics, etc, seem pretty common in larger businesses though. I can’t speak to how to create a Devops culture across a large organisation from the top down, but I’ve been working really hard to create one from the inside.

The beginning

A year ago I’d never heard of the term Devops. If you’re in the same boat, it is easy to find a great deal to read about what Devops is:

And what it is not:

However, I suspect some people will have trouble finding the read-worthy gems amongst all the chatter. Here’s a good place to get started: getting started with devops. The gigantic list of Devops related bookmarks compiled by Patrick Debois shows why you may not want to try and read everything: devops bookmarks

If you’re in the know already and Devops resonates with you, and you want to build a team around the concept, here’s how I went about it.

Networking

The terms Devops didn’t really take shape for me until I started to talk about it with others. Fortunately, London has a really active Devops community so I’ve had ample opportunity. The tireless Gareth Rushgrove organises many events, and The Guardian is a frequent host. I’ve been to sessions discussing Continuous Integration, Deployments, Google App Engine, Load Balancers, Chef, CloudFoundry, etc. I’ve found people to be incredibly open about technology, processes, culture, difficulties and successes they’ve had.

While Devops is of course about more than technology and tools, I personally have found Devops to be an excellent banner under which to have really interesting conversations. Having a forum which brings people from diverse backgrounds together has helped me shape my own internal understanding of what Devops should be about.

I felt a bit of an imposter going to the initial London Devops meetups because I was so keen on recruiting. However, the quality of the discussions has been so good I eagerly anticipate each upcoming meetup even though I’m no longer hiring. I’ve also discovered that half the attendees are also hiring. It’s a Devopsee’s market.

Result!: I met and subsequently hired Stephen Nelson-Smith from Atalanta-Systems. (He’s @Lordcope on twitter, and the author of agilesysadmin.net

Working definition of Devops

If you’re going to hire people with Devops in mind, its good to have a working definition. I like the pillars of Devops (CAMS) put forth by John Willis: what devops means to me

  • Culture
  • Automation
  • Measurement
  • Sharing

SMAC might have been a better acronym, but I’ll go with CAMS.

A Devops job spec

I don’t think Devops is a role, though I’ve seen jobs posting for such a thing. I only mentioned that I was looking for someone ‘Devops-savvy’, and later changed it to ‘Devops-minded’ or something similar. The job posting expired and I’d have to dig it out, but R.I.Pinearr described in on Twitter as the ‘perfect devops job posting’. I’m pretty keen on revising a job spec until the requirements are only things I actually require and can measure against. Saying that, how to write a job spec is way outside the scope of this post. To summarize, I was looking for:

  • problem solving skills
  • ‘can do’ attitude
  • good team fit (really hard to quantify)
  • a broad set of skills (LAMP, Java, C++, Ruby, Python, Oracle, Scaling/Capacity, High-Availability, etc, etc)

My team works on a ton of different technology stacks, and the landscape is constantly changing. Its a techie-dream job, but the interpersonal skills are the most important.

Recruiters

I strongly believe in giving recruiters a fair bit of my time. I’ve seen many people be rude to recruiters, ignore them, etc, and then wonder why they don’t get good candidates through. I’m quite keen on engaging the recruiters, explaining the role I’m trying to fill thoroughly, and having the occasional coffee or beer with them. Feedback is of course vital to candidates, and I try to give it honestly and quickly, letting the recruiter worry about sugar coating things.

CV selection

This is tough. I regularly get CV blindness where everyone starts to look the same. And generally ill-suited. I try to remember there are human beings on the other end and force myself to have concrete reasons why I’m rejecting someone. Talking to a recruiter about this helps me be concrete.

First interview - remote technical test

This is where things get interesting! I don’t know if this is unique to London, but I’ve had a LOT of candidates from other countries apply to join this team. If someone has a good CV and the recruiter vouches for their English language skills, I developed a great screening test which can be conducted remotely. This saves a trip to London + hotel, and I can end it promptly if things aren’t going well. Here’s how it works:

  • I email the candidate/recruiter a url to an ec2 instance that I spin up on the day about 20 minutes before the interview.
  • The instance is running a web browser which contains instructions for the test. These only state that the candidate will need a terminal such as Putty if they’re on Windows.
  • At the arranged time I phone the candidate. I explain that there will be two tests. The first is a sys admin task which will be time bound to 20 minutes. The second is a programming task which they can use the remainder of the time to complete. The call will end after 1 hour.
  • I explain the rules: They are to perform all of their work on the ec2 instance. They have a test account/password, and sudo root access. They can use any resources they want to solve the problems. Google, man pages, libraries are not only fair game, but fully expected.
  • I explain what I want from them: They need to talk to me, tell me what they are thinking, and walk me through the problem solving process. I’m far more interested in that dialogue than whether they solve either problem I give them.
  • I also add that we’re using Screen, and I can see everything they type.
  • I swap the index.html with the complete instructions in place, make note of the time, and let them begin.

The problems

1) Its really quite simple: install Wordpress and configure it to work properly. The catch is that we install mysql first, break it, and then watch as candidates wonder what the heck is going on. For an experienced sysadmin this is child’s play. I tended to interview people with stronger development background and less familiar installing applications. I could tell almost immediately how well someone knew there way around a Linux system. It was interesting to see what kinds of assumptions people made about the system itself (I never mentioned the OS that was running. Several just assumed Ubuntu.) Some people read instructions, some don’t. I give people the mysqladmin password, but some people search on how to reset a lost password because they didn’t read what I gave them. I had one guy spend 10 minutes trying to ssh to http://ec2……. I gave him a pass on nerves, but he continued to suck and I ended it soon there after. He blamed language barrier (Eastern European), and said if only I had been more clear to him. If I can’t communicate with him, I think that’s a pretty big problem and it doesn’t really matter who’s fault it is.

2) We provide sanitized Production Tomcat logs for a real application we support and ask the candidate to write a log parsing script in a language of their choice. We want the output of the script to show methods calls, call counts, frequencies, average and 90% latencies. Our preference is Ruby, but they can do it however they’d like. I had one candidate choose to implement this in Bash and was writing some serious regex-fu that I had no idea how it worked. He got stuck however, and I couldn’t help but ask as he claimed to be a Ruby developer why he didn’t do it in Ruby, which was my stated preference. He started over in Ruby and did okay. Depending how much time was spent on problem 1, this part of the interview is really boring for me. I stay on the phone in case they have questions, I ask them to explain their approach before they begin coding, but then I just start checking email/etc. After 60 minutes total is up, I explain to the candidate that they can continue working on the coding task as long as they need and to send me an email when they’ve finished. I get off the phone however, stating that we’ll give them feedback as soon as we’ve reviewed the code they submit and explain the next steps.

Results

I put several candidates through this process. In the beginning of creating this test, I’d have a couple members of my team on this call as well, but we found this too time consuming and a bit intimidating to certain candidates. Timeboxing problem 1 was a HUGE improvement, and once Stephen Nelson-Smith was on board I had someone better than me at evaluating the Ruby code. We all felt this test process was extremely revealing of candidates skillsets and I highly recommend it.

One of my favourite candidates conducted this interview on a laptop in the shared wifi area of a crowded and noisy London hostel. In the background were screaming people and overbearing Christmas music. He was able to tune out the distractions and nailed both problems with ease, and got major bonus points for doing so.

Round 2 - Face to face interview

Round 2 actually has a few parts:

  • Coffee/lunch/dinner informal chat up to 1 hour in length. I explain what I’m looking for; they can talk about themselves; we can find out if we have a good match.
  • Hypothetical whiteboard problem solving exercise: You receive a call saying customer goes to http://yoursite.com and gets a blank page. What do you do next? We can improvise a bit here on what the actual problem is, but we’re hoping to learn two things: How does this person approach problem solving? What level of architectural complexity have they been exposed to?
  • 2 hours of pair programming with a member of my team. This is usually a real bit of work that needs doing. It could be writing a chef cookbook, or a cucumber test, etc. We want to learn what its like to work closely with this person. My team pair programs often. Do we want to pair with this person day in / day out?

Round 3 - my boss + any member of my team who hasn’t met the candidate yet.

  • This is generally very open, though my boss has her own techniques for evaluating people.

Its very important to me that everyone on my team have a voice. I was quite keen on one candidate, but when one of my team member’s voiced vague concerns about the person’s team-fit, we all stopped and took it on board. We rejected the candidate in the end because once the first doubts were out in the open, other people’s concerns started to be raised as well. I recognised that I was a bit too keen to hire someone to fill a pressing need and am glad how things worked out..

A GREAT candidate/hire

One of my favourite hires not only does he know C, Java, and Linux, but wrote a sample Ruby application because he knew we were looking to hire Ruby skills within the team. His app worked out the shortest path between tube stations, though only in terms of number of stops, not time travelled. This initiative told me a lot about him, and its been 100% the same since he joined the team. Eager to learn and try new things. Any problem/task put in front of him is ‘easy’. My only trouble is he tends to consider problems solved when he’s worked out in his head how he will solve it. This is a bit of a joke really. I accused him the other day of declaring checkmate on a task because he was so confident it would be completed in his next 7 seven steps.

Beyond hiring

Now what? Well, hiring the right people is HUGE. We celebrated each hire, as opposed to the typical ‘leaving drinks’ when people move on. How I manage the team will be a future blog post (I hope), but I’ll add one quick comment. Hiring people according to the vision I had means that I am held accountable as well. Whenever I find myself explaining that the reason for a decision I’m making is ‘politics’, I know I have to change.

About the author

Image

Brian Henerey heads up Operations Engineering in the Online Technology Group at Sony Computer Entertainment Europe. His passions include Devops, Tool-chains, Web Operations, Continuous Delivery and Lean thinking. He’s currently building automated infrastructure pipelines with Ruby, Chef, and AWS, enabling self-service, just-in-time development and test environments for Sony’s Worldwide Studios.

Image Image

Building a Devops team

This is a guest post by Brian Henerey, from Sony Computer Entertainment Europe.

Background

I’ve had 3 roles at Sony since joining in August 2008. Nearly a year ago I took over the management of the original engineering team I joined. This was a failing team by any definition, but I was excited about the opportunity to reshape it. I knew the remaining team was deeply unhappy and likely to quit at any moment, so I had a few immediate goals:

  • Hire!
  • Keep people from quitting.
  • Hire!

Side story: I stumbled on one important objective I didn’t list however. Keep customers happy. It doesn’t matter how awesome you think your team can be if no one wants to work with you based on past experiences. I didn’t appreciate how much a demotivated employee could jeopardise customer relationships by virtue of not caring. It has taken me months to restore trust with one customer. I’ve heard a story about a manager offering employees £500 to quit on a regular basis. I think that probably has some practical problems, but its a tempting idea to cull the unmotivated.

I come from a long background of small/medium size enterprises. It has been a challenge adapting to a large corporation, but I don’t think there’s much unique to Sony about the anti-Devops patterns I’ve encountered. I know several people in small companies who says they’ve been practicing Devops before there was such a word and I completely agree. The trouble of silos, bureaucracy, organizational boundaries, politics, etc, seem pretty common in larger businesses though. I can’t speak to how to create a Devops culture across a large organisation from the top down, but I’ve been working really hard to create one from the inside.

The beginning

A year ago I’d never heard of the term Devops. If you’re in the same boat, it is easy to find a great deal to read about what Devops is:

And what it is not:

However, I suspect some people will have trouble finding the read-worthy gems amongst all the chatter. Here’s a good place to get started: getting started with devops. The gigantic list of Devops related bookmarks compiled by Patrick Debois shows why you may not want to try and read everything: devops bookmarks

If you’re in the know already and Devops resonates with you, and you want to build a team around the concept, here’s how I went about it.

Networking

The terms Devops didn’t really take shape for me until I started to talk about it with others. Fortunately, London has a really active Devops community so I’ve had ample opportunity. The tireless Gareth Rushgrove organises many events, and The Guardian is a frequent host. I’ve been to sessions discussing Continuous Integration, Deployments, Google App Engine, Load Balancers, Chef, CloudFoundry, etc. I’ve found people to be incredibly open about technology, processes, culture, difficulties and successes they’ve had.

While Devops is of course about more than technology and tools, I personally have found Devops to be an excellent banner under which to have really interesting conversations. Having a forum which brings people from diverse backgrounds together has helped me shape my own internal understanding of what Devops should be about.

I felt a bit of an imposter going to the initial London Devops meetups because I was so keen on recruiting. However, the quality of the discussions has been so good I eagerly anticipate each upcoming meetup even though I’m no longer hiring. I’ve also discovered that half the attendees are also hiring. It’s a Devopsee’s market.

Result!: I met and subsequently hired Stephen Nelson-Smith from Atalanta-Systems. (He’s @Lordcope on twitter, and the author of agilesysadmin.net

Working definition of Devops

If you’re going to hire people with Devops in mind, its good to have a working definition. I like the pillars of Devops (CAMS) put forth by John Willis: what devops means to me

  • Culture
  • Automation
  • Measurement
  • Sharing

SMAC might have been a better acronym, but I’ll go with CAMS.

A Devops job spec

I don’t think Devops is a role, though I’ve seen jobs posting for such a thing. I only mentioned that I was looking for someone ‘Devops-savvy’, and later changed it to ‘Devops-minded’ or something similar. The job posting expired and I’d have to dig it out, but R.I.Pinearr described in on Twitter as the ‘perfect devops job posting’. I’m pretty keen on revising a job spec until the requirements are only things I actually require and can measure against. Saying that, how to write a job spec is way outside the scope of this post. To summarize, I was looking for:

  • problem solving skills
  • ‘can do’ attitude
  • good team fit (really hard to quantify)
  • a broad set of skills (LAMP, Java, C++, Ruby, Python, Oracle, Scaling/Capacity, High-Availability, etc, etc)

My team works on a ton of different technology stacks, and the landscape is constantly changing. Its a techie-dream job, but the interpersonal skills are the most important.

Recruiters

I strongly believe in giving recruiters a fair bit of my time. I’ve seen many people be rude to recruiters, ignore them, etc, and then wonder why they don’t get good candidates through. I’m quite keen on engaging the recruiters, explaining the role I’m trying to fill thoroughly, and having the occasional coffee or beer with them. Feedback is of course vital to candidates, and I try to give it honestly and quickly, letting the recruiter worry about sugar coating things.

CV selection

This is tough. I regularly get CV blindness where everyone starts to look the same. And generally ill-suited. I try to remember there are human beings on the other end and force myself to have concrete reasons why I’m rejecting someone. Talking to a recruiter about this helps me be concrete.

First interview - remote technical test

This is where things get interesting! I don’t know if this is unique to London, but I’ve had a LOT of candidates from other countries apply to join this team. If someone has a good CV and the recruiter vouches for their English language skills, I developed a great screening test which can be conducted remotely. This saves a trip to London + hotel, and I can end it promptly if things aren’t going well. Here’s how it works:

  • I email the candidate/recruiter a url to an ec2 instance that I spin up on the day about 20 minutes before the interview.
  • The instance is running a web browser which contains instructions for the test. These only state that the candidate will need a terminal such as Putty if they’re on Windows.
  • At the arranged time I phone the candidate. I explain that there will be two tests. The first is a sys admin task which will be time bound to 20 minutes. The second is a programming task which they can use the remainder of the time to complete. The call will end after 1 hour.
  • I explain the rules: They are to perform all of their work on the ec2 instance. They have a test account/password, and sudo root access. They can use any resources they want to solve the problems. Google, man pages, libraries are not only fair game, but fully expected.
  • I explain what I want from them: They need to talk to me, tell me what they are thinking, and walk me through the problem solving process. I’m far more interested in that dialogue than whether they solve either problem I give them.
  • I also add that we’re using Screen, and I can see everything they type.
  • I swap the index.html with the complete instructions in place, make note of the time, and let them begin.

The problems

1) Its really quite simple: install Wordpress and configure it to work properly. The catch is that we install mysql first, break it, and then watch as candidates wonder what the heck is going on. For an experienced sysadmin this is child’s play. I tended to interview people with stronger development background and less familiar installing applications. I could tell almost immediately how well someone knew there way around a Linux system. It was interesting to see what kinds of assumptions people made about the system itself (I never mentioned the OS that was running. Several just assumed Ubuntu.) Some people read instructions, some don’t. I give people the mysqladmin password, but some people search on how to reset a lost password because they didn’t read what I gave them. I had one guy spend 10 minutes trying to ssh to http://ec2……. I gave him a pass on nerves, but he continued to suck and I ended it soon there after. He blamed language barrier (Eastern European), and said if only I had been more clear to him. If I can’t communicate with him, I think that’s a pretty big problem and it doesn’t really matter who’s fault it is.

2) We provide sanitized Production Tomcat logs for a real application we support and ask the candidate to write a log parsing script in a language of their choice. We want the output of the script to show methods calls, call counts, frequencies, average and 90% latencies. Our preference is Ruby, but they can do it however they’d like. I had one candidate choose to implement this in Bash and was writing some serious regex-fu that I had no idea how it worked. He got stuck however, and I couldn’t help but ask as he claimed to be a Ruby developer why he didn’t do it in Ruby, which was my stated preference. He started over in Ruby and did okay. Depending how much time was spent on problem 1, this part of the interview is really boring for me. I stay on the phone in case they have questions, I ask them to explain their approach before they begin coding, but then I just start checking email/etc. After 60 minutes total is up, I explain to the candidate that they can continue working on the coding task as long as they need and to send me an email when they’ve finished. I get off the phone however, stating that we’ll give them feedback as soon as we’ve reviewed the code they submit and explain the next steps.

Results

I put several candidates through this process. In the beginning of creating this test, I’d have a couple members of my team on this call as well, but we found this too time consuming and a bit intimidating to certain candidates. Timeboxing problem 1 was a HUGE improvement, and once Stephen Nelson-Smith was on board I had someone better than me at evaluating the Ruby code. We all felt this test process was extremely revealing of candidates skillsets and I highly recommend it.

One of my favourite candidates conducted this interview on a laptop in the shared wifi area of a crowded and noisy London hostel. In the background were screaming people and overbearing Christmas music. He was able to tune out the distractions and nailed both problems with ease, and got major bonus points for doing so.

Round 2 - Face to face interview

Round 2 actually has a few parts:

  • Coffee/lunch/dinner informal chat up to 1 hour in length. I explain what I’m looking for; they can talk about themselves; we can find out if we have a good match.
  • Hypothetical whiteboard problem solving exercise: You receive a call saying customer goes to http://yoursite.com and gets a blank page. What do you do next? We can improvise a bit here on what the actual problem is, but we’re hoping to learn two things: How does this person approach problem solving? What level of architectural complexity have they been exposed to?
  • 2 hours of pair programming with a member of my team. This is usually a real bit of work that needs doing. It could be writing a chef cookbook, or a cucumber test, etc. We want to learn what its like to work closely with this person. My team pair programs often. Do we want to pair with this person day in / day out?

Round 3 - my boss + any member of my team who hasn’t met the candidate yet.

  • This is generally very open, though my boss has her own techniques for evaluating people.

Its very important to me that everyone on my team have a voice. I was quite keen on one candidate, but when one of my team member’s voiced vague concerns about the person’s team-fit, we all stopped and took it on board. We rejected the candidate in the end because once the first doubts were out in the open, other people’s concerns started to be raised as well. I recognised that I was a bit too keen to hire someone to fill a pressing need and am glad how things worked out..

A GREAT candidate/hire

One of my favourite hires not only does he know C, Java, and Linux, but wrote a sample Ruby application because he knew we were looking to hire Ruby skills within the team. His app worked out the shortest path between tube stations, though only in terms of number of stops, not time travelled. This initiative told me a lot about him, and its been 100% the same since he joined the team. Eager to learn and try new things. Any problem/task put in front of him is ‘easy’. My only trouble is he tends to consider problems solved when he’s worked out in his head how he will solve it. This is a bit of a joke really. I accused him the other day of declaring checkmate on a task because he was so confident it would be completed in his next 7 seven steps.

Beyond hiring

Now what? Well, hiring the right people is HUGE. We celebrated each hire, as opposed to the typical ‘leaving drinks’ when people move on. How I manage the team will be a future blog post (I hope), but I’ll add one quick comment. Hiring people according to the vision I had means that I am held accountable as well. Whenever I find myself explaining that the reason for a decision I’m making is ‘politics’, I know I have to change.

About the author

Image

Brian Henerey heads up Operations Engineering in the Online Technology Group at Sony Computer Entertainment Europe. His passions include Devops, Tool-chains, Web Operations, Continuous Delivery and Lean thinking. He’s currently building automated infrastructure pipelines with Ruby, Chef, and AWS, enabling self-service, just-in-time development and test environments for Sony’s Worldwide Studios.

Image Image

Kanban for Sysadmin

This article was originally published in December 2009, in Jordan Sissel's SysAdvent

Unless you've been living in a remote cave for the last year, you've probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we're beginning to hear phrases like 'Infrastructure is code', and terms like 'Devops'. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to 'Agile' principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.

I've been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.

Operations teams need to deliver business value

As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform -- is the product -- is the revenue. Especially in tough economic times it's vital to make sure that as sysadmins we are adding value to the business.

In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.

The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.

The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.

Systems teams starting to work alongside such development teams are often tempted to try the same approach.

The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn't fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It's not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn't be delivered in such a short space of time.

Dan Ackerman recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work[1]. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn't seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn't work well for operations - we're solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it's clear that the same tools won't necessarily be appropriate.

What is Kanban, and how might it help?

Let's keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.

Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.

As sysadmins we are not generally delivering product, in the sense that a development team are. We're supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.

Supporting tools

Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.

The word Kanban itself means 'Signal Card' - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile 'story card'. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.

The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..

Kanban teams abandon the concept of iterations altogether. As Andrew Clay Shafer once said to me: "We will just work on the highest priority 'stuff', and kick-ass!"

The Radisson Edwardian

How does Kanban help?

Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.

Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There's no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working for them.

Quality is improved because the WIP limit makes problems visible very quickly. Let's consider two examples - suppose we have a team of four sysadmins:

The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the 'in progress' section of the board, and the flow of work will be too slow. Also it won't always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.

Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there's only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don't concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be 'stuck' on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team's productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.

The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team's working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that's a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.

Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota's lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.

Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there's one thing sysadmins hate, it's being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.

How do I get started?

If you think this sounds interesting, here are some suggestions for getting started.

  • Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: "Let's try this for a month - if you don't feel it's working out, we'll go back to the way we work now".

  • Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.

  • Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn't matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.

  • Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.

  • Agree your WIP limit amongst yourselves - it doesn't matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.

  • Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You'll end up with a huge stack of cards - I keep them in a card box, next to the board.

  • Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there's a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a 'Next Please' section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.

  • Write up a team charter - decide on the rules. You might agree not to work on other people's cards without asking first. You might agree times of the day you'll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it's done. And nobody works on anything that isn't on the board. Write the charter up, and get the team to sign it.

  • Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don't need to ask who is working on what, or what they're going to work on next - that's already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the 'Next Please' section.

  • Set up a ticketing system. I've used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that's going to be longer needs to go on the board. We have a dedicated 'Support' section on our board, with a WIP limit. If there are more support requests than slots on the board, it's up to the requestors to agree amongst themselves which has the greatest business value (or cost).

  • Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using 'SWOT' (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking 'Five Whys' - keep asking why until you really get to the root cause. Also try to ensure you fix things 'Three ways'. These habits are part of a practice called 'Kaizen' - continuous improvement. They feed into your Kanban process, and make everyone's life easier, and improve the quality of the systems you're supporting.

The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:

  • http://limitedwipsociety.org - the home of Kanban for software development; A central place where ideas, resources and experiences are shared.
  • http://finance.groups.yahoo.com/group/kanbandev - the mailing list for people deploying Kanban in a software environment - full of very bright and experienced people
  • http://www.agileweboperations.com - excellent blog covering all aspects of agile operations from a devops perspective

[1]http://www.agileweboperations.com/what-do-sysadmins-really-think-about-agile/