↓ Archives ↓

Archive → March, 2011

Vagrant & Rubylibs

I was testing some MySQL puppet modules on my Vagrant box earlier this week and one of them required augeas.
I kept running into "Could not find a default provider for augeas", however all the appropriate augeas , augeas-lib and ruby-augeas packages were installed. I inspected the different ruby directories and the files were perfectly in /usr/lib/ruby/site_ruby/1.8 where I expected them.

With all the files seemd to be in the right place, my next option was to strace a small ruby script that included augeas, guess what that showed ..

  1. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  2. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  3. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/i686-linux/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  4. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/i686-linux/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  5. stat64("/opt/ruby/lib/ruby/site_ruby/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  6. stat64("/opt/ruby/lib/ruby/site_ruby/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  7. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  8. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  9. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/i686-linux/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  10. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/i686-linux/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  11. stat64("/opt/ruby/lib/ruby/vendor_ruby/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  12. stat64("/opt/ruby/lib/ruby/vendor_ruby/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  13. stat64("/opt/ruby/lib/ruby/1.8/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  14. stat64("/opt/ruby/lib/ruby/1.8/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  15. stat64("/opt/ruby/lib/ruby/1.8/i686-linux/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  16. stat64("/opt/ruby/lib/ruby/1.8/i686-linux/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  17. stat64("./augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  18. stat64("./augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)

Indeed ... vagrant throws the default ruby to /opt/ruby .. and obviously there were no ruby-augeas files in there.

Vagrant Testing, Testing, One Two

Now that we have Vagrant up and running with our favorite Config Management, let's see how we can integrate testing into our workflow.

Given our awesome project from my 'Using Vagrant as a Team' post we have the following components:

[DIR] awesome-vagrant (2)
    - [DIR] awesome-frontend
    - [DIR] awesome-datastore
    - [DIR] awesome-data
    - [DIR] awesome-chefrepo (1a)
    - [DIR] awesome-puppetrepo (1b)

What do we test?

As awesome-{frontend,datastore,data} are considered traditional software components, they would include the usual unit and integration tests from themselves. You can find ample information on the web for your favorite software component.

Cucumber and friends

Testing your configuration management is not that common yet, let's explore our options there:

Most of the current tools are inspired by 'cucumber' a 'behavior driven development' tool. Lindsay Holmwood his great presentation at devopsdays 2009 on 'cucumber-nagios inspired a lot of the authors to use it.

A good book on Cucumber is the rspec book and here is a great slideshare presentation on 'Writing software not code with cucumber' and some caveats in You're cuking it wrong.

Alternatively there is another framework called Babushka that sets out with it's own testing DSL. I find it refreshing to see another approach being build upon.

Puppet testing options

puppet you have 'cucumber-puppet' written by Nikolay Sturm a testing framework for your manifests.

Chef testing options

As chef did not implement the noop-mode, I guess it took some time to have an equivalent.

  • My first thought was to have puppet noop runs against a chef install, but that seemed limited for the business behavior and would only test if chef did it's job.

  • Recently hedgehog announced writing chef steps for cucumber . The good thing is he's packaging these steps +those from cucumber nagios and others into a new gem called 'Cuken (pronounced Cookin)' . The origin of the cuken project is Aruba a set of cucumber tests to test a CLI application.

  • Also do check out Stephen Nelson-Smith [videocast on doing TDD with Chef and Cucumber with LXC containers on EC2] (http://skillsmatter.com/podcast/home/cucumber-chef/js-1541).

Integration testing

For our project we took another route: Instead of testing our chef recipes as standalone piece, we would test the whole of our deployed stack: the provisioned/configured system + all application and data deployed. You have to see this as complementary to your recipe/manifest tests:

  1. Testing all components together allows you to test the interaction/integration,
  2. where as if you only test the recipes itself, it would not test integration stuff like (sessions no being generated). But the advantage is that you have a better idea where things are failing when in type 1 tests.

This is very similar to the complementary fact of unit tests and bdd tests: test inside out, and outside in.

Installing cucumber

cucumber is a rubygem: this means that we now require not only the 'vagrant' gem needs to be installed cucumber and cuken too. Note we will include only cucumber-nagios steps and not the cuken part as they still conflict in their ssh steps.

To avoid that we need to communicate the exact version to every team member or any subsequent gem we need, we set out to create a 'Gemfile' that can be used by bundler. Our Gemfile would look like this

source 'http://rubygems.org'
gem 'vagrant', '0.7.2'
gem 'cuken'
gem 'cucumber'
gem 'cucumber-nagios'

I tried to include cuken (that has the chef steps) work from the latest gitrepo:

gem 'cuken', :git => "git://github.com/hedgehog/cuken.git"
gem 'ssh-forever', :git => "git://github.com/mattwynne/ssh-forever.git"

But it complains on ssh-forever not being there because that version was yanked . So no chef steps yet....

Update: 31/03/2011: It should work, and was probably a temporary fluke in my gemset

Now let's continue the installation of our gems using bundler.

We use a global gemset with rvm to install the bundler gem for all subsequent projects. And install run bundler on our awesome-vagrant gemset

$ rvm gemset use @global
$ gem install bundler
$ bundle install
$ rvm gemset use awesome-vagrant

So now instead of doing 'gem install', you do:

$ bundle install

And it will install all the versions you specified in Gemspec the awesome-vagrant gemset . We add it to our git repo of the awesome-vagrant so people can add things if they need to.

You should now be able to run the cucumber command:

$ cucumber

Setting up our feature structure

In contract to using cucumber with other frameworks such as rails, we have do some work to get it working. We need to create a feature directory similar to below.

    - Vagrantfile
    - Gemspec
    - awesome-{frontend,datastore,date,chefrepo} git repos
    - features
        - steps
            (steps go here)
        - support
        - (features go here)

In env.rb you can put all the necessary requires for libraries you want to include :

require 'bundler'
  Bundler.setup(:default, :development)
rescue Bundler::BundlerError => e
  $stderr.puts e.message
  $stderr.puts "Run `bundle install` to install missing gems"
  exit e.status_code

$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../../lib')

# Disabling cuken until it gets less conflicting with other parts
# require 'cuken/ssh'
# require 'cuken/cmd'
# require 'cuken/file'
# require 'cuken/chef'

# We don't include all nagios steps only the http , but there are of-course more
# require 'cucumber/nagios/steps'
# Disable the following line if you want to use the extended ssh_steps
require 'cucumber/nagios/steps/ssh_steps'
require 'cucumber/nagios/steps/http_steps'
require 'cucumber/nagios/steps/http_header_steps'

require 'rspec/expectations'

# We use mechanize as this doesn't require us to be a rack application
require 'mechanize'
require 'webrat'

World do

Using SSH to run commands

Our first feature using cucumber ssh steps

Let's write our first feature that checks our apache. Based on the example described on the cucumber nagios blogpost

Feature: Executing commands
  In order to test a running system
  As an administrator
  I want to verify the apache behavior

Scenario: Checking if apache is running
    When I ssh to "localhost" with the following credentials: 
     | username | password  |
     | vagrant  | vagrant | 
    And I run "ps -ef |grep http|grep -v grep" 
    Then I should see "http" in the output

Now run (assuming you have apache of course)

$ cucumber 

The problem with the standard cucumber-nagios steps is that it assumes to be on port 22 and vagrant has mapped our port. See the ssh_steps code for details.

Our enhanced version of the ssh steps

We decided to extend the ssh steps to add a few more rinkles to it.

  • Download our extended ssh steps file and put it into the steps directory we created earlier as filename 'ssh_extended_steps.rb'. It extends the ssh_steps to be able specify the ssh_port, and capture stderr, stdout and the exit-code too.
  • And do the same for 'vagrant_steps.rb': this will make your ssh steps vagrant aware

Note: To avoid conflict with the cucumber-nagios be sure to disable the "cucumber/nagios/steps/ssh_steps" in your 'env.rb'

Feature: Executing commands
  In order to test a running system
  As an administrator
  I want to verify the apache behavior

    Scenario: Checking if apache is running through vagrant    
    Given I have a vagrant project in "."    
    When I ssh to vagrantbox "default" with the following credentials: 
    | username | password|
    | vagrant  | vagrant | 
    And I run "ps -ef |grep apache2|grep -v grep" 
    Then I should see "apache2" in the output
    And it should have exitcode 0
    And I should see "apache2" on stdout
    And there should be no output on stderr

The step Given I have a vagrant project, loads the vagrant environment

Given /^I have a vagrant project in "([^\"]*)"$/ do |path|
  @vagrant_env=Vagrant::Environment.new(:cwd => path)

And the step When I ssh to vagrantbox calculates the port it need to ssh too

unless @vagrant_env.multivm?

On a side note, you might notice the @apache2 these are tags in cucumber that you can use to specify only certain tasks. This will only run the features with tag apache

$ cucumber -tags @apache

And this is how you the step When I do a vagrant provision is implemented

And /^I do a vagrant provision$/ do 
  Vagrant::CLI.start(["provision"], :env => @vagrant_env)

Running component unit tests from within the machine

You can use the same mechanism to run your components tests inside the machine itself. You can your application tests mounted inside the VM and run the tests from there. We use it complementary to our 'vagrant project' tests. The advantage of the vagrant tests is that it does an actual network connect without working through loopback and allows you to orchestrate the VM you need to login into in a multivm setup.

Feature: Executing commands
  In order to test a running system
  As an administrator
  I want to verify the apache behavior

    Scenario: Checking if componentX unittests ok  
    Given I have a vagrant project in "."    
    When I ssh to vagrantbox "default" with the following credentials: 
    | username | password|
    | vagrant  | vagrant | 
    And I run "cd /opt/awesome-frontend; rails_env=test rake" 
    And it should have exitcode 0

Testing HTTP access to a vagrant box

Besides running commands on the box, we wanted to be able to check HTTP things. The two main webtesting gems in Ruby/Rails land are either webrat or the newcomer on the block Capybara . Both implement different 'browser' types to check your content: they have adaptors for real browsers (firefox, chrome, safari) through selenium or alike. We needed only simple http testing no DOM checking. The usual suspect is 'rack/test' but as we don't have a rack application that failed miserably. We found that webrat has another option through mechanize. The gem comes installed when you install cucumber_nagios. Also the webrat websteps are implemented in http_steps of cucumber_nagios.

Update 31/03/2011: if using capybara there are two frameworks that look an alternative to leave webrat
- akephalos adapter that aims to be headless unit testing framework - https://github.com/bernerdschaefer/akephalos - mechanize adapter : https://github.com/jeroenvandijk/capybara-mechanize

A feature would like this

Scenario: Surf to apache
Given I go to "http://localhost:9000" 
Then I should see "It works"

Similar to our ssh problem, you see that we have to specify our port to the mapped port of vagrant. And this would also fail for virtual hosts as it would not send the correct 'Host' attribute to the server.

Our enhanced vagrant version adds the Give I go vagrant 'url' syntax

Scenario: Surf to apache via vagrant
Given I have a vagrant project in "."
Given I go to vagrant "http://www.sample.com" 
Then I should see "It works"
Given /^I go to vagrant "([^\"]*)"$/ do |url|

The following snippet implements that virtual_visit:

  • it assumes @vagrant_env is loaded
  • and the correct the Host: headers accordingly to make the site virtual aware
  • it maps the url port to the port in the guest machine
  • the function is added to the webrat module so it is accessible in your steps
module Webrat #:nodoc:
    class Session #:nodoc:
        def virtual_visit(url, data=nil, options = {})
          # Options = Headers in regular visit
            uri = URI.parse(url)

          # We default to the same port

          # Now we translate url port to vagrant port
          # These mappings of ports are global and not per machine
            if @vagrant_env.nil?
            throw "No vagrant environment got loaded"
            @vagrant_env.config.vm.forwarded_ports.each do |name,mapping|
            if mapping[:guestport]==uri.port

          # Override the hostname to the Headers 
            headers=options.merge({ 'Host' => uri.host+":"+port.to_s})

          # For the extended get method we need to wrap it
          # Traditional get method works 
          # => with an URL as first arg
          # => and second  = parameters (methods I guess)
          # But given some other arguments the get command behaves differently
          # See http://mechanize.rubyforge.org/mechanize/Mechanize.html for the source
          # https://github.com/brynary/webrat/blob/master/lib/webrat/adapters/mechanize.rb
          # https://github.com/brynary/webrat/blob/master/lib/webrat/core/session.rb

          # def get(options, parameters = [], referer = nil)
            @response = get({ 
            :headers => headers,
            :url => "#{uri.scheme}://localhost:#{port}#{uri.path}?#{uri.query}", 
            :verb => :get}, nil,options['Referer'])

Now we can use the standard URL and behind the scenes the URL is translated to the correct http request.

Final note:

This is pretty much work in progress, I hope to both contribute to the cuken project for the vagrant and ssh steps to make them uniformly available. Also while writing this blogpost it occurred to me that we need a vagrant-cucumber plugin that will generate the feature structure and integrate cucumber as a subcommand.

Also I'm aware that these are bad examples of BDD, as they don't express Business talk unless your customer is a Sysadmin :)

I've cut off this blogpost here, I did promise you the integration in Jenkins in a CI, so that's the next blogpost.

Hope to hear from you if you found this useful.

Provisioning Workflow – Using Vsphere and Puppet

On a recent project we explored how to further integrate puppet and Vsphere to get EC2 like provisioning, all command-line based.

We leveraged the (Java Vsphere) Vijava interface . For the interested user, I also wrote another blogpost on Programming options for vmware Vsphere, and why libvirt for ESX was (not yet) an option

The result of the Proof of Concept code can be found on the jvspherecontrol project on github

The premise for starting the workflow, is that the servername is added to the DNS first.

  • The name: <apptype>-<environment>-<instance>
    • web-prod-1.<domain>
  • The IP :
    • <IP-prefix>-<vlan-id>-<local ip>
    • : VLAN 30

In our situation, a typical server would have no default routing, but would communicate to the outside uniquely through services mapped through a loadbalancer. This means that all VLAN and Loadbalancing mappings would have been create before that,(that could be automated as well) . We would have DNS entries standardized per VLAN for these kind of services: proxy-30 , ntp-30, dns-30

We didn't want to have dhcp/boot option running in each VLAN, so we decided that a newly created machine would boot in a separated 'Boot-VLAN' to do the initial kickstart. And we would disable (disconnected state in Vsphere) that (boot) network interface after the provisioning was done. The rest of the workflow is pretty standard.

Recently I've heard an alternative way of tackling this problem: it involves created JEOS iso images on the fly for each server with the required network settings. The newly created ISO would be mounted on the Virtual Machine, and it would boot from there. This avoids the need to have a separate boot network interface that you need to disable afterwards.

Because of the VLAN-ed approach we could not have the puppetmaster contact puppetclients directly. To make this work we leverage the use of Mcollective to have the clients listen to an AMQP server.

I'd love to hear about your provisioning approach! Are you doing something similar? Totally different? Any tricks to improve this? Thanks for sharing!

Using Vagrant as a Team

This blogpost goes into detail how we leverage Vagrant in our day to day work. We use it with a team of 7 people to integrate a pretty complex application. To get an idea on the complexity:

  • We have a nodejs server talking to a redis database
  • a grails application that reads from the redis database and writes to a mysql db
  • a rails frontend that reads from the grails rest services and writes to a mysql db
  • a perl application importing data into the mysql db from an external source
  • the nodejs logs via flume to a hadoop storage
  • we extract data via sqoop from the hadoop storage

And all this is done on one Vagrant machine. We can't even imagine having to synchronize this setup on all the different development machines without Vagrant.

So thank you "Mitchell Hashimoto" and "John Bender" for this awesome tool!

We hope this blogpost (and the next ones in this series) will inspire you to do great things with it.

Preparing yourself for takeoff

Standard requirements

Vagrant as described on the website is a tool for building and distributing virtualized development environments.

  • In order to use it, you need to have some things in place :
  • We like to add the following to the mix (not strictly required)
    • we recommend the use of RVM)
    • and have some version control (we use git) installed

Installing rvm (optional)

RVM is a great way of managing various things of ruby on a system. We really like it because: - It does everything in userland (no sudo for gems) - it allows the use of separate gemsets for each (project/customer) individually - allows you to use different versions of ruby on the same machine

Installing it is plain easy:

$ bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )

To have your shell pick it up you can

$ source "$HOME/.rvm/scripts/rvm"

or to make it permanent add it to your .bash_profile

# This loads RVM into a shell session.
$ [[ -s "$HOME/.rvm/scripts/rvm" ]] && source "$HOME/.rvm/scripts/rvm" 

Setting up rvm

Up until now we only have the RVM scripts and no ruby yet. To install f.i. ruby 1.9.2 on your system you can now:

$ rvm install 1.9.2

Setting up a vagrant project called 'awesome' with rvm

Create a directory structure

$ mkdir awesome-vagrant

Now create a file called .rvmrc

echo "rvm_gemset_create_on_use_flag=1" > .rvmrc
echo "rvm gemset use awesome-vagrant" >> .rvmrc
echo "rvm use 1.9.2" >> .rvmrc

Go back one directory

$ cd ..

Trigger the read of the .rvmrc (works through bash hooks) . This will ask you to trust your new .rvmrc file

$ cd awesome-vagrant

RVM has encountered a not yet trusted .rvmrc file in the    =
  = current working directory which may contain nasty code.

You should see the correct ruby and gem version now

$ ruby -version
$ gem -version

So now every-time you enter the 'awesome-vagrant' directory it will have the correct gemset 'awesome-vagrant' loaded, and have the ruby version you like. Pretty cool, not?

Installing git (optional)

Most os'es now have package available for git. Just use your favorite yum, apt, dpkg or whatever to install it.

On Mac OSX you can use macports, we use homebrew because you don't need root rights (it installs stuff in /usr/local/bin)

Alternatively, rvm provides a script based install of git

bash < <( curl http://rvm.beginrescueend.com/install/git )

Firing up the engines

Vagrant 101

Now that all the prerequisites are in place we can move on to the most basic example of using vagrant.

The example on the vagrant website goes like this

$ cd awesome-vagrant
$ gem install vagrant
$ vagrant box add base http://files.vagrantup.com/lucid32.box
$ vagrant init
$ vagrant up
$ vagrant ssh

Et voilà, that's all it takes to get you up and running as a developer with a lucid box! Pretty neat he? Under the cover the following happens:

  • vagrant box add base http://files.vagrantup.com/lucid32.box
    • it will download lucid32.box file
    • extract the lucid32.box file into your $HOME/.vagrant/boxes directory
    • and give it the name 'base'
  • vagrant init :
    • creates a file called 'Vagrantfile' in your current directory
    • when you look at the file, it will contain the directive
      config.vm.box = "base"
      , this is what makes the link to the box we called 'base'
    • you can further edit the Vagrantfile before you start it
  • vagrant up:
    • up until now, no virtual machine was created
    • therefore vagrant will import the disks from the box 'base' into Virtualbox
    • map via NAT the port 22 from your VM to a free local port
    • it will create a .vagrant file : a file that contains a mapping between your description 'base' and the UUID of the virtual machine
    • If you want to follow the magic, just start Virtualbox and you will see the machine being created
  • vagrant ssh:
    • this will lookup the mapping of the ssh inside and will execute the SSH process to log into the machine
    • use a privatekey of use vagrant to login to a box that has the user vagrant with it's public setup in the virtual machine

What about windows?

Some of our team members are not using a MacOSX or Linux variant but are running Windows.

There are some excellent instructions in getting Vagrant running on windows as a host: - Vagrant and Windows - Vagrant and Windows 64-Bit - Jruby, Winole32, Vagrant and Windows

We used the following:

  • Install Java 64 Bit version
  • Set $JAVA_HOME environment variable to the 64 Bit version
  • Put $JAVA_HOME/bin in your path
  • Install the Jruby 64 Bit (Ole version)
  • Put $Jruby/bin in your path
  • install the vagrant gem
  • use Putty instead of vagrant ssh subcommand
  • import the vagrant private key into your putty

Starting the vagrant command is a lot slower then under linux/macosx. I don't know why, but it slows the interaction down.

  • We found that destroying a windows box, sometimes requires you to manually cleanup the Virtualbox Machine directory of that virtual machine.

Running windows as VM managed by Vagrant is currently still a dream. But the Winrm project is making good way to become a ssh alternative to windows machines. The opscode guys are already integrating winrm in chef/knife. Maybe I'll start writing a winrm vagrant plugin for that soon.

A word on Vagrant baseboxes

Up until recently, finding Vagrant baseboxes, was matter of searching the internet and finding the URL's on different individuals websites. Gareth Rushgrove has done a great job by setting up vagrantbox.es where you can submit your own baseboxes in a central directory.

Those baseboxes are great, but you have to trust the one who packaged the box. In the future we might see vendors providing baseboxes for their setup similar to providing official AMI's on Amazon, but we're not there yet.

You can create a virtualbox virtual machine yourself (manual install, pxe install, or starting from an existing basebox), and then export it as a vagrant box

$ vagrant package --base my_base_box

Introducing veewee : an easy way to bootstrap new baseboxes

An alternative is to use veewee to bootstrap a machine automatically from scratch. This a vagrant plugin I created that eases the creation of baseboxes from scratch. It simulates a manual install by levering VRDP to type some linux boot string and have the kickstart/preseed read over an HTTP server.

The following is a rundown on how to create an ubuntu basebox with veewee

Install the gem:

$ gem install veewee

List the veewee basebox definitions available:

$ vagrant basebox templates 
The following templates are available:
vagrant basebox define '<boxname>' 'ubuntu-10.10-server-i386'
vagrant basebox define '<boxname>' 'ubuntu-10.10-server-i386-netboot'

Define a new box , this creates a definition directory

$ vagrant basebox define 'myubuntu' 'ubuntu-10.10-server-i386' 

Have a look at the definition directory and change them if you want

$ ls definitions/myubuntu  
definition.rb postinstall.sh preseed.cfg

Build the box. Note this will download the necessary iso file if needed

$ vagrant basebox build 'myubuntu' 

Export the created vm as a basebox. This will finally create a myubuntubox.box

$ vagrant basebox export'myubuntu' 

It's still experimental, but we have automated installation for various versions of Archlinux,Centos, Debian, Freebsd, Ubuntu working. I think the benefit from it, is that you don't need a PXE environment to setup machines and it allows you to test your preseed, kickstart files and version control the behavior of your basebox.

Remember it's code

Now is a good time to version control your awesome-vagrant project

$ cd 
$ git init
$ git add Vagrantfile
$ git commit -m "This was just my first commit"

Taking a test flight

Getting your code on board

Now that you have your basebox running and are able to login to it, I know you are eager to start development. So let's grab that code you already did from git.

$ cd awesome-vagrant
$ git clone git@somerepo:/var/git/awesome-datastore
$ git clone git@somerepo:/var/git/awesome-frontend
$ git clone git@somerepo:/var/git/awesome-data

This results in the following structure

[DIR] awesome-vagrant
    - [DIR] awesome-datastore (component1)
    - [DIR] awesome-frontend (component2)
    - [DIR] awesome-data (component3)
    - Vagrantfile

Each directory shown here is a git repository that is checked out separately. Now we can mount this as directories inside our virtualmachine. This is what vagrant calls shared folders.

Our Vagrantfile looks like this

config.vm.share_folder "awesome-datastore", "/home/vagrant/awesome-datastore", "./awesome-datastore"
config.vm.share_folder "awesome-frontend", "/home/vagrant/awesome-frontend", "./awesome-frontend"
config.vm.share_folder "awesome-data", "/home/vagrant/awesome-data", "./awesome-data"

This will set up the directories inside your vm so you can edit them using your favorite IDE on your laptop and have the files instantly available inside your VM without the need for sync.

After editing the file you need to 'reboot' the machine to take this settings

$ vagrant reload

We've hit quite a few problems with writing to shared folders. Standard Vagrant used the Virtualbox Guest additions to share a folder of your local/host machine to the Virtual machine. There have been numerous of complaints about the stability and therefore you might want to check out the use of NFS folders to share the directories. Just add the share NFS flag at the end. Please note that this requires an nfs-client to be installed in the basebox first.

config.vm.share_folder "awesome-frontend", "/home/vagrant/awesome-frontend", "./awesome-frontend",{:nfs => true}

The communication between host and vm is done over a host-only network, so your nfs shares will not get exposed to the outside world. Therefore you need to enable hostonly networking by adding the following to your vagrant file

config.vm.network ""

Don't forget to reload after changing that

$ vagrant reload

Adding config management to the mix

It might be tempting to login into your new vagrant box and install a bunch of packages manually to get things started. You should all remember Willem van den Ende saying Server login considered harmful

The real power of vagrant is that it promotes the use of configuration management for that. Infrastructure as code, FTW!

Vagrant currently support both Chef-Solo, Chef, Puppet, Puppet-Server and bash scripting as 'provisioners'. Provisioners are different from traditional installation scripts, as they follow the idempotence principle. They can be run over and over again and get the same results.

The vagrant command to run this is:

$ vagrant provision

and provisioning is also run when you do a

$ vagrant up

If don't want it to run, you can specify

$ vagrant up --no-provision

Chef-Solo sample setup

The setup and explanation of Chef is beyond the scope of this blogpost. There is a great description on the Opscode website on how to setup a chef repository.

    [DIR]cookbooks #those that come from opscode
    [DIR]site-cookbooks #or your own

A sample Vagrantfile snippet looks like this:

config.vm.provision :chef_solo do |chef|
    chef.cookbooks_path = ["awesome-chefrepo/cookbooks",
    chef.log_level = "debug"
                    :mysql => {
                        :server_root_password => "supersecret",
                        :server_repl_password => "supersecret",
                        :server_debian_password => "supersecret"},
                    :java => {
                        :install_flavor => "sun"}

Running 'vagrant provision' will:

  • share the cookbooks_path's (and rolepaths,...) in the virtualmachine
  • generate a solo.rb configfile and transfer it to /tmp
  • generate a dna.json file: a merge of a vagrant json information and the json you provided
  • login to the virtualmachine as vagrant and do a
    sudo chef-solo -r solo.rb -j dna.json

More detailed notes can be found on the Vagrant Provisioner website section

Puppet sample setup

James Turnbull wrote the puppet provisioner

We setup our puppet-repo like this:

    [DIR] manifests
    [DIR] modules

With the corresponding the following puppet Vagrantfile section

config.vm.provision :puppet do |puppet|
    puppet.pp_path = "/tmp/vagrant-puppet"
    puppet.manifests_path = "./awesome-puppetrepo/manifests"
    puppet.module_path = "./awesome-puppetrepo/modules"
    puppet.manifest_file = "./awesome-puppetrepo/mybox.pp"

Where mybox.pp contains the manifest (f.i. apache2) to be run on that box

package "apache2": { ensure => 'installed' }

Running 'vagrant provision' will:

  • share the manifests_paths + module_paths in the virtualmachine
  • transfer your manifest_file to /tmp
  • login to the virtualmachine as vagrant and do a
    sudo puppet --modulepath awesome-puppetrepo/modules mybox.pp

More detailed notes can be found on the Vagrant Provisioner website section

Opening up the box - Network

Now that you have both your code and your environment inside the VM setup, the next step is to gain access to some of the network services. Vagrant makes this damn easy by mapping ports inside the VM to ports on your local system.

# Forward a port from the guest to the host, which allows for outside
  # computers to access the VM, whereas host only networking does not.
  # config.vm.forward_port "http", 80, 8080
  config.vm.forward_port "awesome-datastore", 8080, 8080
  config.vm.forward_port "awesome-frontend", 8000, 80

Again to make these mapping take effect you need to restart vagrant box

$ vagrant reload

Now you can surf to your http://localhost:8080 and access your frontend inside the box

Overcoming bad network performance

We noticed that some of our network services would perform badly when accessed from the outside and be fast from the inside. At first we suspected it to be the Virtualbox network natting slowing things down, but it turned out that DNS resolving was causing the delays. We been told before that 'Everything is Freaking DNS problem' and yes:

  • depending on the network you were running, DNS was badly setup for resolving internal IP's. Check your resolver
  • we had libavahi installed (apparently came with java) : so we had to disable that to speed up the resolving

Tuning your engines

Customizing Vagrantfile

The great thing about the Vagrantfile is that is actual ruby code.

Settings for only some hosts

The following snippit allows us to still share the Vagrantfile but allow people to use NFS if they need it

# Switching to nfs for only those who want it
nfs_hosts=%w(mylaptop1 ruben-meanmachine)

require 'socket'

if nfs_hosts.include?(my_hostname)
      # Assign this VM to a host only network IP, allowing you to access it
      # via the IP.
      config.vm.network ""
      share_flags={:nfs => true}

Enabling different settings based on environment

Besides people developing code or configuration management code, we also have people who use a Vagrant machine to give demo's at various place. They pull the latest version from git and are able to have the VM build with the latest features enabled.

For the demos they don't need to have the shared directories of all the code components available. We introduced the notion of

  • vagrant_env : development,test, production, demo,
  • awesome_mode : a flag indicating what mode our applications should run into

Both can be set as environment variables and are picked up by the Vagrantfile

if (vagrant_env=="development" && ENV["AWESOME_MODE"]!="demo")

Setting these variables is as easy as prepending them to the vagrant command

$ vagrant_env=development vagrant up

And now as a team please :

Some observations

We've been using vagrant as a team for about 2 months now and here are some observations we made:

  • It clearly helps everybody to have a consistent environment to develop against, the lastest version is just one git pull away.
  • The central approach drives people to a) do frequent commits and b) do stable commits.

  • The task of writing recipes/manifests is not picked up by all team members, and seem to stay the main job of the system oriented people on the team.

  • Reading manifests help people understand what is needed and makes it easy to point out what needs to be changed. But learning the skills to write recipes/manifest is a blocking factor just as having a backend developer writing frontend code.
  • When manifest/recipes are modified during a sprint, provisioning an existing virtual machine might fail as we don't take migrations from one VM state to the other into account. In that case, a box destroy and full provision is required.
  • The test the admins do before committing their manifests, is that they destroy their own 'development' box and re-provision a new box to see if this works.

  • The longer the the provision takes, the less frequent people do it. It's important to keep that process as fast as possible. It's all about feedback and we want it fast.

  • Installation problems would get noticed far sooner in the process.
  • People would only do a full rebuild in the morning when getting their coffee.

Having both development and production mode running on the Vagrant box

In our environment we have both the development and the production version running.

F.i. for our rails component we have:

  • a share of the awesome-frontend inside the box (and when this starts it runs on port 3000)
  • we have the production mode running on port 80 (pulled from git as the latest tagged production)

This allows us to easily have both versions running. The production version is installed by a manifest/recipe and the share is started manually.

In summary

Vagrant rocks, but by now you should know !

Don't worry the journey continues...

In the next post, I'll introduce you how we setup Vagrant with testing and use a Continuous integration environment to have it build a new box , run the tests and make everybody happy. So stay tuned!

For additional inspiration:

Monitoring Framework: Event Correlation

Since my last post I’ve spoken to a lot of people all excited to see something fresh in the monitoring space. I’ve learned a lot – primarily what I learned is that no one tool will please everyone. This is why monitoring systems are so hated – they try to impose their world view, they’re hard to hack on and hard to get data out. This served only to reinforce my believe that rather than build a new monitoring system I should build a framework that can build monitoring systems.

DevOps shops who can cut code, should be able to build the monitoring they want, not the monitoring their vendor thought they want.

Thus my focus has not been on how can I declare relationships between services, or how can I declare an escalation matrix. My focus has been on events and how events relate to each other.

Identifying an Event
Events can come from many places, in the recent video demo I did you saw events from Nagios and events from MCollective. I also have event bridges for my Apache Blackbox, SNMP Traps and it would be trivial to support events from GitHub commit hooks, Amazon SNS and really any conceivable source.

Events need to be identified then so that you can send information related to the same event from many sources. Your trap system might raise a trap about a port on a switch but your stats poller might emit regular packet counts – you need to know these 2 are for the same port.

You can identify events by subject and by name together they make up the event identity. Subject might be a FQDN of a host and name might be load or cpu usage.

This way if you have many ways to input information related to some event you just need to identify them correctly.

Finally as each event gets stored they get given a unique ID that you can use to pull out information about just a specific instance of an event.

Types Of Event
I have identified a couple of types of event in the first iteration:

  • Metric – An event like the time it took to complete a Puppet run or the amount of GET requests served by a vhost
  • Status – An event associated with an up/down style state transition, can optional embed a metrics event
  • Archive – An event that you just wish to archive along with others for later correlation like a callback from GitHub saying code was comitted and by whom

The event you see on the right is a metric event – it doesn’t represent one specific status and it’s a time series event which in this case got fed into Graphite.

Status events get tracked automatically – a representation is built for each unique event based on its subject and name. This status representation can progress through states like OK, Warning, Critical etc. Events sent from many different sources gets condensed and summarized into a single status representing how that status looks based on most recent received data – regardless of source of the data.

Each state transition and each non 0 severity event will raise an Alert and get routed to a – pluggable – notification framework or frameworks.

Event Associations and Metadata

Events can have a lot of additional data past what the framework needs, this is one of the advantages of NoSQL based storage. A good example of this would be a GitHub commit hook. You might want to store this and retain the rich data present in this event.

My framework lets you store all this additional data in the event archive and later on you can pick it up based on event ID and get hold of all this rich data to build reactive alerting or correction based on call backs.

Thanks to conversations with @unixdaemon I’ve now added the ability to tag events with some additional data. If you are emitting many events from many subsystems out of a certain server you might want to embed into the events the version of software currently deployed on your machine. This way you can easily identify and correlate events before and after an upgrade.

Event Routing
So this is all well and fine, I can haz data, but where am I delivering on the promise to be promiscuous with your data routing it to your own code?

  • Metric data can be delivered to many metrics emitters. The Graphite one is about 50 lines of code, you can run many in parallel
  • Status data is stored and state transitions result in Alert events. You can run many alert receivers that implement your own desired escalation logic

For each of these you can write routing rules that tell it what data to route to your code. You might only want data in your special metrics consumer where subject =~ /blackbox/.

I intent to sprinkle the whole thing with a rich set of callbacks where you can register code that declares an interest in metrics, alerts, status transitions etc in addition to the big consumers.

You’d use this code to correlate the amount of web requests in a metric with the ones received 7 days ago. You can then decide to raise a new status event that will alert Ops about trend changes proactively. Or maybe you want to implement your own auto-scaler where you’d provision new servers on demand.

How does it scale? Horizontally. My tests have shown that even on a modest (virtual) hardware I am able to process and route in excess of 10 000 events a minute. If that isn’t enough you can scale out horizontally by spreading the metric, status and callback processing over multiple physical systems. Each of the metric, status and callback handlers can also scale horizontally over clusters of servers.

Bringing It All Together
So to show that this isn’t all just talk, here are 2 graphs.

This graph shows web requests for a vhost and the times when Puppet ran.

This graph shows Load Average for the server hosting the site and times when Puppet ran.

What you’re seeing here is a correlation of events from:

  • Metric events from Apache Blackbox
  • Status and Metric events for Load Averages from Nagios
  • Metric events from Puppet pre and post commands, these are actually metrics of how long each Puppet run was but I am showing it as a vertical line

This is a seemless blend of time series data, status data and randomly occurring events like when Puppet runs, all correlated and presented in a simple manner.

Exclusive DevOps.com Interview with @DEVOPS_BORAT

great pic of @devops_boratby-@mattokeefe

@DEVOPS_BORAT has exploded onto the DevOps scene as of late, via Twitter. He won Best Cloud Philospher and Best Cloud Tweet at Cloudy Awards 2011. DevOps.com is pleased to share with you an exclusive interview:

@mattokeefe: Congrats on winning multiple Cloudies! Were you able to attend the award ceremony?

@DEVOPS_BORAT: In Kazakhstan we have old saying “If you can lean, you can clean”. Is why I not attend conference but prefer deploy infrastructure on all possible cloud provider.

@mattokeefe: Totally understandable. So where do you work and what is your role?

@DEVOPS_BORAT: I work in small startup in Almaty Kazakhstan. Is part of many startup launch by incubator company with name which is translate as Bird. In Kazakh language is spinoff know as Dropping, so our startup is Bird Dropping #53 finance by venture oil capitalist. We are specialize in social networking in the cloud with emphasis on Human To Android relationship.

In day to day job I have title of Senior Manager of Operation. I manage of myself and of Azamat who has title of Junior Manager of Operation and he only manage himself. We are sufficient manpower for deal with all devops issue in the cloud because we have everything automated. Also we use NoSQL which make it very easy scale, I can not able say at infinite but for practical purpose is infinite.

@mattokeefe: Did you start your career in development or operations? When did you first hear about DevOps?

@DEVOPS_BORAT: Word devops is start with dev then ops. I start career in development of C++ (as small detail, in Kazakh language is pronounce ++C which is more correct). I am happen of agree with Joel Overflow that programmer need learn C/C++ first so they understand pointer. If you not experience null pointer segfault is like you not experience sexytime!

I hear of DevOps in past 2 or 3 years, is coincide with downfall of Agile and rise of cloud and also of Twitter. First reason DevOps was create is because is easier type #devops than #developer #sysadmin but correct name is in actual OpsDev.

For practice DevOps I recommend first follow cloud expert and devops expert on Twitter. Next step is automate bulls shit out of everything.

@mattokeefe: I know what you mean about null pointer segfaults. I’ve seen Java log files full of NullPointerExceptions. When I showed the developers, they said “Oh, those are harmless. You can ignore them.” But they never went away, and I worried that Ops wouldn’t detect a real problem later. Is this something that DevOps can fix?

@DEVOPS_BORAT: In startup if we hear such comment from developer we immediately put them on pager for 1 month. Next time is 2 month and so on. Problem can not be able fix by DevOps. Only way to fix is not use Java in first place. All DevOps rock star are use Scala or Clojure.

@mattokeefe: Wow, you really are hardcore. So tell me, what DevOps tools do you use, and what do you find missing? Are you following the devops-toolchain project?

@DEVOPS_BORAT: Between my personally and Azamat we are try very hard for use all available DevOps tool. Nothing is perfect though so we end up roll our own tool. We tentative call tool Swiss Army Electric Saw, is good for monitor, alert, visual metric, queue, deploy, continuous integration and continuous delivery. Tool is base on node.js so it eliminate disk I/O. We also try hard eliminate network traffic by only allow 56k bandwidth for legacy customer.

I read page of devops-toolchain, I can not able comprehend with limited English. Is philosophic dissertation yes? Is very good if somebody is able get PhD out of it.

@mattokeefe: You guys are very ambitious with tooling. It sounds like you could use more help. If you could hire just one more person for your team, would you choose a developer interested in learning Operations, or an Ops guy looking to learn how to code?

@DEVOPS_BORAT: We are always search for mythical centaur creature 1/2 dev and 1/2 ops. We have business idea of launch RoR Web site for dating of dev and ops. We are hope for ROI in approximate 20 year.

@mattokeefe: Awesome. Are you looking for investors? The DevOps market seems to be heating up, despite Damon Edwards talking about shark jumping and Some Forrester Guy asking for NoOps. What do you think of these remarks?

@DEVOPS_BORAT: We have lot of interest from oil and natural gas baron in Russia. Not need VC dollar from U S and A. As matter of fact region of Almaty Kazakhstan is know as Silicon Camel Hump of Central Asia.

I read blog post of Forrester guy. Content is Noop. In my opinion DevOps is just sign of what is for come. Is going be follow by DevQaOps, then DevUxQaOps, DevUxQaSecOps and in final is pinnacle of Internet Jedi Samurai Jason Calacanis.

@mattokeefe: Well I am blown away by the wisdom that you have imparted so far! I can’t wait to share this with our readers, so I will publish part one of this interview ASAP. Let’s see what sort of questions our readers raise in the comments section, yes? Then I hope that we can have a follow up interview.
Now, just to wrap up part one, how do you celebrate successful DevOps achievements… erm… “happy sexytime” in your country? Do you enjoy pizza and beer, or some other custom?

In startup we have quiet celebration usual. Is involve 2-3 keg of vodka. Is loud first 30 minute then very quiet. Azamat is last to stand. In past our ancestor use of shoot ibex after celebration. In startup we continue tradition by terminate random node in cloud after party, is same thrill of feeling.

@mattokeefe: Oh no, WordPress is down! I hope they didn’t sustain collateral damage from your Chaos Monkey!

“We’re experiencing some problems on WordPress.com and we are in read-only mode at the moment. We’re working hard on restoring full service as soon as possible, but you won’t be able to create or make changes to your site currently.”

Thanks for the interview!

"Meet the DevOps Experts" panel at Cloud Connect 2011 (video)

I had the pleasure of being the track chair for the DevOps Track at Cloud Connect 2011. It was a short track (3 sessions) but thankfully a great lineup of all-stars accepted my invitation to participate!

Here is the video from the panel featuring all of our DevOps Track speakers. The audio is a bit soft in parts, but I think you'll find it to be great content. I played the role of moderator and spent most of my time in the audience getting questions from the attendees. 


Andrew Shafer - Cloud Scaling
Teyo Tyree - Puppet Labs
Alex Honor - DTO Solutions / Rundeck Project
James Urquhart - Cisco
Juan Paul Ramirez - Shopzilla
Lloyd Taylor - ngmoco:)


Note: Some of the attendees had video cameras out and may have recorded the other sessions in the track. If I uncover those videos, I'll post it ASAP.


Thinking about monitoring frameworks

I’ve been Tweeting a bit about some prototyping of a monitoring tool I’ve been doing and had a big response from people all agreeing something has to be done.

Monitoring is something I’ve been thinking about for ages but to truly realize my needs I needed mature discovery based network addressing and ways to initiate commands on large amounts of hosts in a parallel manner. I have this now in the MCollective and I feel I can start exploring some ideas of how I might build a monitoring platform.

I won’t go into all my wishes, but I’ll list a few big ones as far as monitoring is concerned:

  • Current tools represent a sliding scale, you cannot look at your monitoring tool and ever know current state. Reported status might be a window of 10 minutes and in some cases much longer.
  • Monitoring tools are where data goes to die. Trying to get data out of Nagios and into tools like Graphite, OpenTSDB or really anywhere else is a royal pain. The problem get much harder if you have many Nagios probes. NDO is an abomination as is storing this kind of data in MySQL. Commercial tools are orders of magnitude worse.
  • Monitoring logic is not reusable. Today with approaches like continuous deployment you need your monitoring logic to be reusable by many different parties. Deployers should be able to run the same logic on demand as your scheduled monitoring does.
  • Configuration is a nightmare of static text, or worse click driven databases. People mitigate this with CM tools but there is still a long turn around time from node creation to monitored. This is not workable in modern cloud based and dynamic systems.
  • Shops with skilled staff are constantly battling decades old tools if they want to extend it to create metrics driven infrastructure. It’s all just too ’90s.
  • It does not scale. My simple prototype can easily do 300+ checks a second, including processing replies, archiving, alert logic and feeding external tools like Graphite. On a GBP20/month virtual machine. This is inconceivable with most of the tools we have to deal with.

I am prototyping some ideas at the moment to build a framework to build monitoring systems with.

There’s a single input queue on a middleware system, I expect an event in this queue – mine is a queue distributed over 3 countries and many instances of ActiveMQ.

The event can come from many places maybe from a commit hook at GitHub, fed in from Nagios performance data or by MCollective or Pingdom, the source of data is not important at all. It’s just a JSON document that has some structure – you can send in any data in addition to a few required fields, it’ll happily store the lot.

From there it gets saved into a capped collection on MongoDB in its entirety and gets given an eventid. It gets broken into its status parts and its metric parts and sent to any number of recipient queues. In the case of Metrics for example I have something that feeds Graphite, you can have many of these all active concurrently. Just write a small consumer for a queue in any language and do with the events whatever you want.

In the case of statusses it builds a MongoDB collection that represents the status of an event in relation to past statusses etc. This will notice any state transition and create alert events, alert events again can go to many destinations – right now I am sending them to Angelia, but there could be many destinations with different filtering and logic for how that happens. If you want to build something to alert based on trends of past metric data, no problem. Just write a small plugin, in any language, and plug it into the message flow.

At any point through this process the eventid is available and should you wish to get hold of the original full event its a simple lookup away – there you can find all the raw event data that you sent – stored for quick retrieval in a schemaless manner.

In effect this is a generic plugable event handling system. I currently feed it from MCollective using a modified NRPE agent and I am pushing my Nagios performance data in real time. I have many Nagios servers distributed globally and they all just drop events into a their nearest queue entry point.

Given that it’s all queued and persisted to disk I can create really vast amount of alerts using MCollective – it’s trivial for me to create 1000 check results a second. The events have the timestamp attached of when the check was done and even if the consumers are not keeping up the time series databases will get the events in the right order and right timestamps. So far on a small VM that runs Puppetmaster, MongoDB, ActiveMQ, Redmine and a lot of other stuff I am very comfortably sending 300 events a second through this process without even tuning or trying to scale it.

When I look at a graph of 50 servers load average I see the graph change at the same second for all nodes – because I have an exact single point in time view of my server estate, and what 50 servers I am monitoring in this manner is done using discovery on MCollective. Discovery is obviously no good for monitoring in general – you dont know the state of stuff you didn’t discover – but MCollective can build a database of truth using registration – correlate discovery against registration and you can easily identify missing things.

A free side effect of using an async queue is that horizontal scaling comes more or less for free, all I need to do is start more processes consuming the same queue – maybe even on a different physical server – and more capacity becomes available.

So this is a prototype, its not open source – I am undecided what I will do with it, but I am likely to post some more about its design and principals here. Right now I am only working on the event handling and routing aspects as the point in time monitoring is already solved for me as is my configuration of Nagios, but those aspects will be mixed into this system in time.

There’s a video of the prototype receiving monitor events over mcollective and feeding Loggly for alerts here.

Watching the Guards

A couple of weeks ago I noticed a weird drop in web usage stats on the site you are browsing now. Kinda weird as the drop was right around Fosdem when usually there is a spike in traffic.

So before you start.. no I don't preach on practice on my own blog, it's a blog dammit, so I do the occasional upgrades on the actual platform , with backups available, do some sanity tests and move on, yes I break the theme pretty often but ya'll reading this trough RSS anyhow.

My backups showed me that drush had made a copy of the Piwik module somewhere early february, exactly when this drop started showing. I verified the module , I verified my Piwik , - Oh Piwik you say .. yes Piwik, if you want a free alternative to Google Analytics , Piwik rocks .. - I even checked other sites using the same piwik setup and they were all still functional happily humming and being analyzed.... everything fine ... but traffic stayed low ..

This taught me I actually had to upgrade my Piwik too ...

So that brings me to the point I`m actually wanting to make...
as according to @patrickdebois in his chapter on Monitoring "Quis custodiet ipsos custodes?" who's monitoring the monitoring tools, who's monitoring the analytics tools,

So not only should you monitor the availability of yor monitoring tools, you should also monitor if their api hasn't changed in some way or another.
Just like when you are monitoring an web app you shoulnd't just see if you can connect to the appropriate http port, but you should be checking if you get sensible results back from it , no gibberish.

But then again ... there's no revenue in my blog or its statistics :)

Ad-Hoc Configuration, Coordination and the value of change

For those who don't know, I'm currently in Boston for DevOps Days. It's been amazing so far and I've met some wonderful people. One thing that was REALLY awesome was the open space program that Patrick set up. You won't believe it works until you've tried it. It's really powerful.

In one of our open spaces, the topic of ZooKeeper came up. At this point I made a few comments, and at the additional prodding of everyone went into a discussion about ZooKeeper and Noah. I have a tendency to monopolize discussions around topics I'm REALLY passionate about so many thanks for everyone who insisted I go on ;)

Slaughter the deviants!
The most interesting part of the discussion about ZooKeeper (or at least the part I found most revealing) was that people tended to have trouble really seeing the value in it. One of the things I've really wanted to do with Noah is provide (via the wiki) some really good use cases about where it makes sense.

I was really excited to get a chance to talk with  Alex Honor (one of the co-founders of DTO along with Damon Edwards) about his ideas after his really interesting blog post around ad-hoc configuration. If you haven't read it, I suggest you do so.

Something that often gets brought up and, oddly, overlooked at the same time is the where ad-hoc change fits into a properly managed environment (using a tool like puppet or chef).

At this point, many of you have gone crazy over the thought of polluting your beautifully organized environment with something so dirty as ad-hoc changes. I mean, here we've spent all this effort on describing our infrastructure as code and you want to come in and make a random, "undocumented" change? Perish the thought!

However, as with any process or philosophy, strict adherence with out understanding WHEN to deviate will only lead to frustration. Yes, there is a time to deviate and knowing when is the next level of maturity in configuration management.

So when do I deviate
Sadly, knowing when it's okay to deviate is as much a learning experience as it was getting everything properly configured in the first place. To make it even worse, that knowledge is most often specific to the environment in which you operate. The whole point of the phrase ad-hoc is that it's..well...ad-hoc. It's 1 part improvisation/.5 parts stumbling in the dark and the rest is backfilled with a corpus of experience. I don't say this to sound elitist.

So, really, when do I deviate. When/where/why and how do I deviate from this beautifully described environment? Let's go over some use cases and point out that you're probably ALREADY doing it to some degree.

Production troubleshooting
The most obvious example of acceptable deviation is troubleshooting. We pushed code, our metrics are all screwed up and we need to know what the hell just happened. Let's crank up our logging.

At this point, changing your log level, you've deviated from what your system of record (your CM tool) says you should be. Our manifests, our cookbooks, our templates all have us using a loglevel of ERROR but we just bumped up one server to DEBUG. so we could troubleshoot. That system is now a snowflake. Unless you change that log level back to ERROR, you know have one system that will, until you do a puppetrun of chef-client run is different than all the other servers of the class/role.

Would you codify that in the manifest? No. This is an exception. A (should be) short-lived exception to the rules you've defined.

Dynamic environments
Another area where you might deviate is in highly elastic environments. Let's say you've reached the holy grail of elasticity. You're growing and shrinking capacity based on some external trigger. You can't codify this. I might run 20 instances of my app server now but drop back down to 5 instances when the "event" has passed. In a highly elastic environment, are you running your convergence tool after every spin up? Not likely. In an "event" you don't want to have to take down your load balancer (and thus affect service to the existing intstances) just to add capacity. A bit of a contrived example but you get the idea.

So what's the answer?
I am by far not the smartest cookie in the tool shed but I'm opinionated so that has to count for something. These "exception" events are where I see additional tools like Zookeeper (or my pet project Noah) stepping in to handle things.

Distributed coordination, dynamically reconfigurable code, elasticity and environment-aware applications.
These are all terms I've used to describe this concept to people. Damon Edwards provided me with the last one and I really like it.

Enough jibber-jabber, hook a brother up!
So before I give you the ability to shoot yourself in the foot, you should be aware of a few things:

  • It's not a system of record

Your DDCS (dynamic distributed coordination service as I'll call it because I can't ever use enough buzzwords) is NOT your system of record. It can be but it shouldn't be. Existing tools provide that service very well and they do it in an idempotent manner.

  • Know your configuration

This is VERY important. As I said before, much of this is environment specific. The category of information you're changing in this way is more "transient" or "point-in-time". Any given atom of configuration information has a specific value associated with it. Different levels of volatility. Your JDBC connection string is probably NOT going to change that often. However, the number of application servers might be at different amounts of capacity based on some dynamic external factor.

  • Your environment is dynamic and so should be your response

This is where I probably get some pushback. Just as one of the goals of "devops" was to deal with, what Jesse Robbins described to day as misalignment of incentive, there's an internal struggle where some values are simply fluctuating in near real time. This is what we're trying to address.

  • It is not plug and play

One thing that Chef and Puppet do very well is that you can, with next to no change to your systems, predefine how something should look or behave and have those tools "make it so".

With these realtime/dynamic configuration atoms your application needs to be aware of them and react to them intelligently.

Okay seriously. Get to the point
So let's take walk through a scenario where we might implement this ad-hoc philosophy in a way that gives us the power we're seeking.

The base configuration

  •  application server (fooapp) uses memcached, two internal services called "lookup" and "evaluate" and a data store of somekind.
  • "lookup" and "evaluate" are internally developed applications that provide private REST endpoints for providing a dictionary service (lookup) and a business rule parser of some kind (evaluate).
  • Every component's base configuration (including the data source that "lookup" and "evaluation" use) is managed, configured and controlled by puppet/chef.

In a standard world, we store the ip/port mappings for "lookup" and "evaluate" in our CM tool and tags those. When we do a puppet/chef client run, the values for those servers are populated based on the ip/port information our EXISTING "lookup"/"evaluate" servers.

This works. It's being done right now.

So where's the misalignment?
What do you do when you want to spin up another "lookup"/"evaluate" server? Well you would probably use a bootstrap of some kind and apply, via the CM tool, the changes to those values. However this now means that for this to take effect across your "fooapp" servers you need to do a manual run of your CM client. Based on the feedback I've seen across various lists, this is where the point of contention exists.

What about any untested CM changes (a new recipe for instance). I don't want to apply that but if I run my CM tool, I've now not only pulled those unintentional changes but also forced a bounce of all of my fooapp servers. So as a side product of scaling capacity to meet demand, I've now reduced my capacity at another point just to make my application aware of the new settings.

Enter Noah
This is where the making your application aware of its environment and allowing it to dynamically reconfigure itself pays off.

Looking at our base example now, let's do a bit of architectural work around this new model.

  • My application no longer hardcodes a base list of servers prodviding "lookup" and "evaluate" services.
  • My application understands the value of a given configuration atom
  • Instead of the hardcoded list, we convert those configuration atoms akin to something like a singleton pattern that points to a bootstrap endpoint.
  • FooApp provides some sot of "endpoint" where it can be notified of changes to the number/ip addresses or urls available a a given of our services. This can also be proxied via another endpoint.
  • The "bootstrap" location is managed by our CM tool based on some more concrete configuration - the location of the bootstrap server.

Inside our application, we're now:

  • Pulling a list of "lookup"/"evaluate" servers from the bootstrap url (i.e. http://noahserver/s/evaluate)
  • Registering a "watch" on the above "path" and providing an in-application endpoint to be notified when they change.
  • validating at startup if the results of the bootstrap call provide valid information (i.e. doing a quick connection test to each of the servers provided by the bootstrap lookup or a subset thereof)

If we dynamically add a new transient "lookup" server, Noah fires a notification to the provided endpoint with the details of the change. The application will receive a message saying "I have a new 'lookup' server available". It will run through some sanity checks to make sure that the new "lookup" server really does exist and works. It then appends the new server to the list of existing (permanent servers) and start taking advantage of the increase in capacity.

That's it. How you implement the "refresh" and "validation" mechanisms is entirely language specific. This also doesn't, despite my statements previously, have to apply to transient resources. The new "lookup" server could be a permanent addition to my infra. Of course this would have been captured as part of the bootstrapping process if that were the case.

And that's it in a nutshell. All of this is availalbe in Noah and Zookeeer right now. Noah is currently restricted to http POST endpoints but that will be expanded. Zookeeper treats watches as ephemeral. Once the event has fired, you must register that same watch. With Noah, watches are permanent.

I hope the above has made sense. This was just a basic introduction to some of the concepts and design goals. There are plenty of OTHER use cases for ZooKeeper alone. So the key take aways are:

  • Know the value of your configuration data
  • Know when and where to use that data
  • Don't supplant your existing CM tool but instead enhance it.

Hadoop Book (which has some AMAZING detail around ZooKeeper, the technology and use cases