Pulp and RHN
As I’m having to rebuild our Pulp server at work I thought I’d take a moment to document how to sync content from Redhat. This is not as straightforward as it sounds and it’s barely documented anyway. The most important thing you need to know is that you can only sync from the CDN that redhat setup at cdn.redhat.com and to get access to the cdn you must register the server with subscription-manager rather than the old fashioned rhn_register.
To do that you use:
subscription-manager register subscription-manager refresh subscription-manager subscribe --auto
Once you’re registered to the machine you need to find your SSL certs. These can be found in:
/etc/rhsm/ca/redhat-uep.pem /etc/pki/entitlement/*.pem
Once you have these you can look for the repositories you wish to sync in /etc/yum.repo.d/redhat.repo. With this information in hand we can turn to Pulp:
pulp-admin repo create --id=ops-live-rhel-5-x86_64-os --feed=https://cdn.redhat.com/content/dist/rhel/server/5/5Server/x86_64/os --feed_ca=/etc/rhsm/ca/redhat-uep.pem --feed_key=/etc/pki/entitlement/longnumber-key.pem --feed_cert=/etc/pki/entitlement/longnumber.pem pulp-admin repo sync --id=ops-live-rhel-5-x86_64-os
That’s all it takes to get your content from RHN. This works best on Pulp running on RHEL6.
Hiera and creating resources
Note: For the network_config{} module we’re using please visit Github.
We’ve had a problem at work for a while and I haven’t been able to think of an elegant and self contained way of solving it. This morning however _rc from #puppet gave me a shove in the right direction and pointed me towards the Puppet function create_resources(). You pass a hash to create_resources() and it makes you a bunch of resources, effectively. I figured this would be an excellent point to introduce Hiera.
The problem we’re trying to solve here is the ability to define multiple NICs per node without having a massively complicated manifest. We don’t want hundreds of classes, one per machine, and we can’t easily just have one master class because we may have anywhere from 1 to 100 NICs, we just don’t know up front.
Currently we have a class per role because we’re mostly using this for loadbalancing purposes, here’s an example:
class network::cmsmail { include network::lvs network_config{ 'lo:0': bootproto => 'static', ipaddr => 'x.x.x.x', netmask => '255.255.255.255', network => 'x.x.x.0', broadcast => 'x.x.x.255', onboot => 'yes', ensure => present, require => Class['network::lvs'], } network_config{ 'lo:1': bootproto => 'static', ipaddr => 'x.x.x.x', netmask => '255.255.255.255', network => 'x.x.x.0', broadcast => 'x.x.x.255', onboot => 'yes', ensure => present, require => Class['network::lvs'], } }
Each class has a number of these. The biggest we currently have is 9 network_config{}’s. Now I don’t want to get into the configuration of Hiera as there’s numerous great guides and documentation out there. I’m going to leap into explaining our hierarchy and then we’ll work out how to build the hash for the define we’re going to need.
:hierarchy: - hosts/%{fqdn}
- environment/%{environment}/location/%{location}
- environment/%{environment}/domain/%{domain}
- environment/%{environment}/rhel/%{rhelversion}
- environment/%{environment}/common
- location/%{location}
- domain/%{domain}
- rhel/%{rhelversion}
- commonThis is what we have today. Effectively stuff for hosts comes first, then we check for specific stuff per environment, then location, domain, and version of RHEL. If all that fails we fall down to common and then break if nothing is found. Not too complicated but it gives us a lot of flexiblity around giving different results per environment.
To begin with I’ve tried to make a test hash. I ran into the problem that our interface names include : and that’s the delimiter for yaml so I’m guessing and hoping I can wrap it in “”s:
---
network:
"lo0:0":
ipaddr: '192.168.100.1'
netmask: '255.255.255.255'
network: '192.168.100.0'
broadcast: '192.168.100.255'
require: 'Class[network::lvs]'
"lo0:1":
ipaddr: '192.168.100.2'
netmask: '255.255.255.255'
network: '192.168.100.0'
broadcast: '192.168.100.255'
require: 'Class[network::lvs]'To test this I load up ‘irb’ and do the following:
require 'yaml' require 'pp' a = YAML::load( File.open( './host.yaml' ) ) pp a
This gives me the output:
irb(main):005:0> pp a
{"network"=>
{"lo0:1"=>
{"require"=>"Class[network::lvs]",
"broadcast"=>"192.168.100.255",
"network"=>"192.168.100.0",
"netmask"=>"255.255.255.255",
"ipaddr"=>"192.168.100.2"},
"lo0:0"=>
{"require"=>"Class[network::lvs]",
"broadcast"=>"192.168.100.255",
"network"=>"192.168.100.0",
"netmask"=>"255.255.255.255",
"ipaddr"=>"192.168.100.1"}}}
=> nilTo put this to the test however we need to make a define in Puppet that we can call from create_resources() that will build these up:
define network::config_nics ($ipaddr, $netmask, $network, $broadcast, $onboot='yes', $require='') { network_config{ "$name": ensure => present, ipaddr => "$ipaddr", netmask => "$netmask", network => "$network", broadcast => "$broadcast", onboot => 'yes', bootproto => 'static', require => $require, } }
This lets us pass in the parameters from the hash and can be called multiple times safely because the title of the network_config is “$name” which should be the nic name thanks to the hash (if we got things right!) For testing purposes we’ll make a new class that we know isn’t assigned anywhere:
class network::test { $nics = hiera(network) create_resources("network::config_nics",$nics) }
After having built a test machine and attaching network::test I unfortunately got:
err: Failed to apply catalog: Could not find dependency Class[Network::Lvs] for Network::Config_nics[lo0:0]
This should be fixable by modifying our definition to include:
if $require != '' { include $require }
However we quickly run into problems:
err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class Class[network::lvs] for test at /etc/puppet/environments/common/network/manifests/config_nics.pp:5 on node test warning: Not using cache on failed catalog
At this point I realize that I’m trying to do “include Class[network::lvs]“. We’ll refactor the require part of the hash to be class_require and just take a class name. In addition to this we’ll need to modify the definition. Before we had a require => statement that always applied but we don’t want to end up with “require => Class[]“. This leads to:
---
network:
"lo0:0":
ipaddr: '192.168.100.1'
netmask: '255.255.255.255'
network: '192.168.100.0'
broadcast: '192.168.100.255'
class_require: 'network::lvs'
"lo0:1":
ipaddr: '192.168.100.2'
netmask: '255.255.255.255'
network: '192.168.100.0'
broadcast: '192.168.100.255'
class_require: 'network::lvs'define network::config_nics ($ipaddr, $netmask, $network, $broadcast, $onboot='yes', $class_require='') { if $class_require != '' { include $class_require Class["$class_require"] -> Network_config <| |> } network_config{ "$name": ensure => present, ipaddr => "$ipaddr", netmask => "$netmask", network => "$network", broadcast => "$broadcast", onboot => 'yes', bootproto => 'static', } }
Another puppet run and:
notice: /Firewall[005 - ntp]/source: source changed '127.0.0.1' to '127.0.0.1/32' notice: Firewall[005 - ntp](provider=iptables): Properties changed - updating rule notice: /Stage[main]/Network::Test/Network::Config_nics[lo0:1]/Network_config[lo0:1]/ensure: created notice: /Stage[main]/Network::Test/Network::Config_nics[lo0:0]/Network_config[lo0:0]/ensure: created notice: Finished catalog run in 34.18 seconds
This pretty much wraps it up for now, we can have an arbitary number of NICs configured with a single define in puppet by simply passing hashes via Hiera. We could easily target environments or other fact criteria than just specific hosts to automatically apply multiple NIC configurations to entire pools of servers.
Refactoring Jenkins
I’ve recently been working on building out Puppet manifests for Jenkins and remote builds and while this is a pretty basic example of a module I thought it would be an excellent place to start out with for this blog. The aim of this blog is to introduce real world usage of Puppet for the average sysadmin, rather than barraging you with incredibly slick and sophisicated examples when you’re still trying to figure out how to push a file with Puppet.
As a result let me introduce two modules: Jenkins and Buildbot. As you can imagine, the Jenkins module is for the Jenkins server and then Buildbot is for all the remote build slaves. These modules have grown organically and were not all pre-planned. The buildbot came after and resulted in a whole bunch of compromises so that the buildbot servers have access to the same things that Jenkins does. We have a lot of duplication between the two modules and some really messy code. In this blog I’m going to hopefully refactor some of this and show you some real world Puppet code.
To begin with an overview of the manifests:
jenkins/manifests/init.pp – Every module needs an init.pp and this one is no different.
jenkins/manifests/server.pp – The bulk of the manifest as it currently stands.
jenkins/manifests/ssh_key.pp – Code for ensuring SSH private keys are in place.
buildbot/manifests/init.pp – All the code for the buildbot so far.
I’ve checked this code into https://github.com/apenney/jenkins-refactor. While refactoring this code I’ll actually be working in the repo at work but I’ll checkpoint along the way and push that to the public repo.
Jenkins
Lets dive in and look at the jenkins module first. We’ll skip init.pp as it actually only contains a skeleton of “class jenkins { }”. If we skip to the server.pp file we get something finally:
class jenkins::server { include jenkins::ssh_key firewall { '010 jenkins inbound': proto => 'tcp', dport => '8080', action => 'accept', } yumrepo { 'jenkins': name => 'jenkins', descr => 'jenkins', baseurl => 'https://pulp.sys.perimeterusa.com/pulp/repos/jenkins/', enabled => '1', gpgcheck => '1', } package { [ 'jenkins', 'createrepo', 'java-1.6.0-sun', 'java-1.6.0-sun-devel', 'git', ]: ensure => present, require => [ Yumrepo['jenkins'], Redhat::Rpmkey['jenkins'] ], } redhat::rpmkey { 'jenkins': } service {'jenkins': enable => true, ensure => 'running', hasrestart=> true, require => [ Package['jenkins'], Package['java-1.6.0-sun'], ], } # Keystore for the LDAP certificate, required for LDAPS authentication. file { '/usr/lib/jvm/jre-1.6.0-sun.x86_64/lib/security/cacerts': ensure => present, source => 'puppet:///modules/jenkins/cacerts', } # DISABLED: Only needed to change jenkins_home currently. #file { '/etc/sysconfig/jenkins': # ensure => present, # source => 'puppet:///modules/jenkins/sysconfig/jenkins', # require => Package['jenkins'], #} security::sudo_adduser { 'jenkins-client': sudotag => 'NOPASSWD', requiretty => 'negate', } }
As you can see, this module is just this massive block of code, mixing dependencies for building together with the code to actually install jenkins. We’re going to start by breaking up Jenkins into seperate chunks and then see how much code we can reuse in the build bot.
We’ll begin by just cutting and pasting massive chunks around so that we have:
jenkins/manifests/server.pp
jenkins/manifests/server/install.pp
jenkins/manifests/server/config.pp
jenkins/manifests/server/ssh_key.pp
If we look at install.pp first we see:
class jenkins::server::install { yumrepo { 'jenkins': name => 'jenkins', descr => 'jenkins', baseurl => 'https://pulp.sys.perimeterusa.com/pulp/repos/jenkins/', enabled => '1', gpgcheck => '1', } package { [ 'jenkins', 'createrepo', 'java-1.6.0-sun', 'java-1.6.0-sun-devel', 'git', ]: ensure => present, require => [ Yumrepo['jenkins'], Redhat::Rpmkey['jenkins'] ], } redhat::rpmkey { 'jenkins': } }
If we look at the package section it’s pretty clear some of these items are only needed for builds (createrepo, git, java-1.6.0-sun-devel) and we can move these out of the install module. In addition we are going to break out the repo code. This way when we have a jenkins “dependencies” class we can simply require the repo class to get all our dependencies handled without explicitly listing them out.
After breaking it out we have:
class jenkins::server::install { include jenkins::server::repo include jenkins::server::dependencies package { [ 'jenkins', 'java-1.6.0-sun' ]: ensure => present, require => Class['jenkins::server::repo'], } }
class jenkins::server::repo { yumrepo { 'jenkins': name => 'jenkins', descr => 'jenkins', baseurl => 'https://pulp.sys.perimeterusa.com/pulp/repos/jenkins/', enabled => '1', gpgcheck => '1', } redhat::rpmkey { 'jenkins': } }
class jenkins::server::dependencies { package { [ 'createrepo', 'git', 'java-1.6.0-sun-devel' ]: ensure => present, } }
We now have three smaller classes just to handle the install. This probably looks messier at first than just having a single class but we’ve now gained two benefits. We can include and require the dependencies class from buildbot to remove some duplication (without accidently installing all of jenkins). We can also get the jenkins repo without installing Jenkins as well. This kind of code reuse becomes all the more important as your modules scale up. When you require resources directly it makes it harder to refactor things in the future. By requiring classes, where possible, it hides the implementation details from callers of the class.
We won’t break up config.pp as it’s all specific to jenkins. What we can do is add some requires to classes again to make this more robust:
class jenkins::server::config { firewall { '010 jenkins inbound': proto => 'tcp', dport => '8080', action => 'accept', } service {'jenkins': enable => true, ensure => 'running', hasrestart=> true, require => Class['jenkins::server::install'], } # Keystore for the LDAP certificate, required for LDAPS authentication. file { '/usr/lib/jvm/jre-1.6.0-sun.x86_64/lib/security/cacerts': ensure => present, source => 'puppet:///modules/jenkins/cacerts', require => Class['jenkins::server::install'], } security::sudo_adduser { 'jenkins-client': sudotag => 'NOPASSWD', requiretty => 'negate', } }
We’ll make one final change in jenkins for now, adding a require to ssh_key.pp so that we can be sure Jenkins is installed (and therefore a user exists) before trying to add an ssh key:
class jenkins::ssh_key { ssh::private_key { 'id_rsa': user => 'jenkins', location => '/var/lib/jenkins/.ssh', keylocation => 'jenkins', require => Class['jenkins::server::install'], } }
Buildbot
Now we’ll dive into the buildbot module. This is supposed to include all the required functionality for a remote machine to check out and build things that are in Jenkins. When setting this up I realized that much of the code must be the same as jenkins considering you need git, java dependencies, and so forth. Here is the code as it is now:
class buildbot { # We need the ssh key from jenkins to be able to check out repos. include jenkins::ssh_key # Base dependencies that all buildbots will need. package { ['rpm-build', 'gcc', 'make', 'git', 'java-1.6.0-sun', 'java-1.6.0-sun-devel', 'rubygems', 'ruby-devel', ]: ensure => present, } package { 'fpm': ensure => latest, provider => 'gem', require => [ Package['ruby-devel'], Package['rubygems'] ], } file { '/srv/jenkins': ensure => directory, owner => 'jenkins', group => 'jenkins', } file { '/var/lib/jenkins/.ssh/config': ensure => present, source => 'puppet:///modules/buildbot/ssh_config', require => User['jenkins'], } group { 'jenkins': ensure => present, gid => '497', } user { 'jenkins': ensure => present, uid => '498', gid => '497', comment => 'Jenkins Continuous Build server', home => '/var/lib/jenkins/', shell => '/bin/bash', password => '*', managehome => true, require => Group['jenkins'], } ssh_authorized_key { 'buildbot': ensure => present, key => '', type => 'ssh-rsa', user => 'jenkins', require => User['jenkins'], } }
We’ll start by removing everything that we know is in the jenkins module and include that instead. When looking over the package statement at the top I notice it requires the java-1.6.0-sun package. This is currently included in the jenkins::server::install class as it’s needed just to install jenkins. If we move this to the dependencies class then we can just include that in buildbot:
This gives us the following package statement in buildbot:
package { ['rpm-build', 'gcc', 'make', 'rubygems', 'ruby-devel', ]: ensure => present, }
And then jenkins::server::dependencies gets:
package { [ 'createrepo', 'git', 'java-1.6.0-sun-devel', 'java-1.6.0-sun' ]: ensure => present, }
And jenkins::server::install gets:
package { 'jenkins' ]: ensure => present, require => [ Class['jenkins::server::repo'], Class['jenkins::server::dependencies'] ], }
Next I’m going to break up the main buildbot into a bunch of seperate classes for each kind of thing we can build. I’m going to be including them all on my build servers for now but in the future we may have boxes dedicated to different kinds of builds.
This leads us to:
class buildbot::base { package { ['gcc', 'make' ]: ensure => present, } file { '/srv/jenkins': ensure => directory, owner => 'jenkins', group => 'jenkins', require => User['jenkins'], } file { '/var/lib/jenkins/.ssh/config': ensure => present, source => 'puppet:///modules/buildbot/ssh_config', require => User['jenkins'], } group { 'jenkins': ensure => present, gid => '497', } user { 'jenkins': ensure => present, uid => '498', gid => '497', comment => 'Jenkins Continuous Build server', home => '/var/lib/jenkins/', shell => '/bin/bash', password => '*', managehome => true, require => Group['jenkins'], } }
class buildbot::rpm { package { 'rpm-build': ensure => present, } package { 'fpm': ensure => latest, provider => 'gem', require => Class['buildbot::ruby'], } }
class buildbot::ruby { package { [ 'rubygems', 'ruby-devel' ]: ensure => present, } }
Having done this we notice we’re defining a jenkins user and group in the buildbot class (because Jenkins isn’t being installed there). Does this make sense? It feels like it should live in the jenkins module to me. We’re going to create jenkins::server::account and move it to there and then just include it in the buildbot. This way we automatically gain control over the user and group on the jenkins server too rather than relying on the jenkins installer. We’ll even require them before we let jenkins install to ensure our uid/gid is in sync.
Here’s our jenkins::server::install package{}:
package { 'jenkins' ]: ensure => present, require => Class['jenkins::server::account', 'jenkins::server::repo', 'jenkins::server::dependencies'], }
And here’s buildbot::base:
class buildbot::base { package { ['gcc', 'make' ]: ensure => present, } file { '/srv/jenkins': ensure => directory, owner => 'jenkins', group => 'jenkins', require => Class['jenkins::server::account'], } file { '/var/lib/jenkins/.ssh/config': ensure => present, source => 'puppet:///modules/buildbot/ssh_config', require => Class['jenkins::server::account'], }
Once again we’re able to reduce our duplication and bring the code to a single place. This way if we need to change our uid we’re not stuck doing it in multiple locations. This is a good point to check in all the work we’ve done (you’ll find this on the above repo listed as the first round of refactoring). In reality I’ve been checking in to my repo in much smaller chunks just in case. This is a good point to do a round of testing on our real servers.
Before I go any further I want to point out that this step normally takes fixes. Modules in other blogs seem to erupt from the ground with perfection but mine don’t. Here are the things I had to fix:
* Forgot to change class jenkins::ssh_key to jenkins::server::ssh_key
* Forgot to remove the ] in package { ‘jenkins’ ]: in server::install
It’s worth running ‘find -name ‘*.pp’ | xargs -n 1 -t puppet parser validate’ on your code before commiting it.
The output from my jenkins server:
info: Applying configuration version '1337281489' err: /Stage[main]/Jenkins::Server::Account/User[jenkins]/home: change from /var/lib/jenkins to /var/lib/jenkins/ failed: Could not set home on user[jenkins]: Execution of '/usr/sbin/usermod -d /var/lib/jenkins/ jenkins' returned 8: usermod: user jenkins is currently logged in notice: /Stage[main]/Jenkins::Server::Account/User[jenkins]/shell: shell changed '/bin/false' to '/bin/bash' notice: /Stage[main]/Jenkins::Server::Account/User[jenkins]/password: changed password notice: /Stage[main]/Jenkins::Server::Install/Package[jenkins]: Dependency User[jenkins] has failures: true warning: /Stage[main]/Jenkins::Server::Install/Package[jenkins]: Skipping because of failed dependencies notice: /Stage[main]/Jenkins::Server::Ssh_key/Ssh::Private_key[id_rsa]/Exec[mkdir -p /var/lib/jenkins/.ssh]: Dependency User[jenkins] has failures: true warning: /Stage[main]/Jenkins::Server::Ssh_key/Ssh::Private_key[id_rsa]/Exec[mkdir -p /var/lib/jenkins/.ssh]: Skipping because of failed dependencies notice: /File[/var/lib/jenkins/.ssh/id_rsa]: Dependency User[jenkins] has failures: true warning: /File[/var/lib/jenkins/.ssh/id_rsa]: Skipping because of failed dependencies notice: /File[/usr/lib/jvm/jre-1.6.0-sun.x86_64/lib/security/cacerts]: Dependency User[jenkins] has failures: true warning: /File[/usr/lib/jvm/jre-1.6.0-sun.x86_64/lib/security/cacerts]: Skipping because of failed dependencies notice: /Stage[main]/Jenkins::Server::Config/Service[jenkins]: Dependency User[jenkins] has failures: true warning: /Stage[main]/Jenkins::Server::Config/Service[jenkins]: Skipping because of failed dependencies notice: Finished catalog run in 24.66 seconds
Quite a lot of errors! We can eliminate the first one by changing our jenkins::server::account to just use /var/lib/jenkins without the trailing /. This should fix up the rest.
info: Applying configuration version '1337278790' notice: Finished catalog run in 24.35 seconds
Now to the buildbot:
err: Failed to apply catalog: Could not find dependency Class[Jenkins::Server::Install] for Ssh::Private_key[id_rsa] at /etc/puppet/environments/common/jenkins/manifests/server/ssh_key.pp:8
All we need to do here is fix this to point to account instead of install:
ssh::private_key { 'id_rsa': user => 'jenkins', location => '/var/lib/jenkins/.ssh', keylocation => 'jenkins', require => Class['jenkins::server::account'], }
That gives us a clean run:
info: Applying configuration version '1337282301' notice: Finished catalog run in 24.27 seconds
And with that we’re all done for now. There are other improvements that could be made but our primary goal here was to cut out as much duplication as possible and break out some huge modules into smaller ones.
My thoughts about Networking at scale
Designing networks always reminds me about traffic control, with a few adjustments:
- packets can only travel with constant speed from point A to point B.
- packets can not stand still on the freeway they have to buffer in city parking.
- Red lights are forbidden
- Roundabouts that will divert traffic with dynamic road signs (we call them routers).
- If you need to do a roundabout make sure it can handle the traffic
Now given these (and I probably missed a few), would you:
- Place a roundabout on the middle of 405 between Santa Monica and Highway 5?
- Have all traffic go to downtown LA before you go anywhere else?
- Have everybody go to work at 9 and come back at 5 (wait… we do that:))?
- Widen highway 405 over the Sepulveda pass (wait…we are doing that:))?
- Don’t use a top of rack switch for routing if what you need is a juniper MX80 or a MLX4e
So here is my point:
- Rsync via cron (e.g. start at 9 and come home at 5 is a bad idea), spread out the load over time.
- Distributed spine: fastest path between A and B and A and D is good without going over C.
- Widen network bandwidth in high traffic points (e.g. go 40 gig between 405 and Highway 5)
- Don’t do Layer 3 when all you need is simple Layer 2 (e.g. no roundabouts on 405).
- If you are going to use buffers (city parking), make sure you are not flowing over (e.g. use large buffers).
- Design your network for peak, although you took away the 9-5. Most people want to have lunch at work!
There is a lot math missing here. But I think you get the general idea.
You can find lots of inspiring white papers about traffic flow (although a lot of it does not apply).
Over 190,000 metrics in Graphite.
So, as many of you know, I am more a developer than an ops guy. Doing development for this space for years have led me to a few conclusions….. Also working with people like Grant Kushida who in my opinion is notoriously thorough when it comes to metrics, is incredibly inspiring. I should also mention Juan Paul Ramirez from Shopzilla and John Willis and Damon Edwards that during this time inspired me and my team to push forward.
- Metrics is implemented as part of your code. E.g. pulling metrics out of a system is not a profession.
- Cost of getting metrics negligible (optimally “fire and forget”)
My first experience with real-time business metrics was at Yahoo. Anthony Molinaro is the inventor of LWES (today open source) http://sourceforge.net/projects/lwes/. At Yahoo we filled the graphs with lots of metrics straight from the code so we could get real-time business behavior. The design of LWES is simple key value and the model of course “fire and forget” over UDP. You can read more about LWES at lwes.org as well. LWES is also extremely powerful to trace a query through multiple back-end systems in real-time. Again: “Simple is Powerful”.
At Rubicon we went with Graphite, from an ops perspective it had a lot of stuff we needed from the “get go”. We could not start with going to the developers and say “- implement this”. Instead we wrote bash scripts etc to get the data out of the system and on to the graphite server. We also grabbed all our nagios metrics. Very quickly we had a system where we did not have to develop new rrds to move forward. Growth of the number of metrics did not come without cost. Quickly the graphite server needed more resources. If there is one negative thing about graphite it would be the data store. There is probably better ways of designing this. But so far we just throw more hardware at it (And yes I do think that sometimes that is the right answer).
I think my point here is that you can get visibility in to business metrics straight from your application log files and with very simple means and uncomplicated systems give lots of value to your business. Also as I know for a fact, when you see a node doing less dollars per minute. You know you need to do something. When you have these metrics it will be clear to the business and to the developers that the value of this is instrumental for your controlled development strategy.
At Rubicon we now have more than 190.000 metrics (Sean in my team lost count after that). But it is now part of and not a result of development. What I mean is people are thinking about how they can use it as part of simple feed-back loops. I am not sure how efficient that would be, but it shows that it is part of the culture in development. This is more important than anything else.
“- Let’s stop tailing log files on production servers, it is so 1994″
Interop in Las Vegas 2012
I do enjoy interop in Las Vegas not just because it is in Vegas;)
Things that stuck out this year:
- Brocade was nowhere to be found (odd)…
- Dell did a good job both on servers and networking
- Most hosting providers are now doing private cloud (not bad..)
- I still think the 10 gig copper side is a bit thin on switches
- HP seems more confused than usual? Or is it just me.
- Arista was bigger than ever
There was a lot to learn. And as always you find something that you did not know about before. I think my personal favorite is http://www.symmetricom.com/ ntp time for the data center. I like the idea of having an antenna on the top of the roof. But seriously, the way that we do ntpd currently enables about 1.5 ms drift on the servers. In real time trading that can be important.
Also I like Chembro chassis but I think that is just me. I really can’t resist these kind of devices. I remember a time where nothing like this could be found and the PC was not meant for the data center. I ended up building my own 2U units for SAAB Space in Sweden. Lot’s of fun and mechanical design. Now you can just buy a chassis and get 80 TB net on a device (with some assembling of course). Just maybe it is time for “IKEA for the data center“?
DevOps
Hi,
Jan Gelin, VP of Technical Operations at Rubiconproject. http://www.rubiconproject.com!
This blog is intended to give guidance and reveal some of the things going on in my world.
So what is my world?
The TechOps team at the Rubiconproject!
So where did I come from?
I feel like an old timer in the industry. DEC, AltaVista, Yahoo, Fox, it’s been a fun ride so far. But this one is my biggest adventure. Why you might ask? Technology and Open Source has come a long way from where we used to burn VGA connectors writing our own drivers for AlphaServers. But that was fun too! I feel like we are in an era where Open Source makes sense and there is a business model with a lot of things based on support and service level. Rubicon as a company is not adverse to support the companies we work with and run together with.
I have done a lot of development in my past. But here at Rubicon is where development and operations come together, in terms of methodology and approach.
Rubicon, currently 1600 servers in 5 data centers, all managed by puppet, func, cfengine, graphite. The first time I heard about graphite was from Shopzilla. I liked the idea of having timestamp, key, value to be the input. That would make it easy for developers to create business metrics. Today we have more than 190.000 metrics in our system. It requires a lot of memory in the boxes, believe me. Thanks Orbitz for making this piece of technology available for us.
So let us start with the term “DevOps“. To me “DevOps” is a methodology and not a team or something that you can touch. We started to be devops 2 years ago now. And it is like Buddhism, you never graduate. Our road has been rocky and filled with challenges. Although I have to say we have come a long way.
To start things off here is a little video we made about what I call Network Operations Center (NOC). It is not your usual NOC. Here is where people get started in Infrastructure Development. https://vimeo.com/39365287
BTW all those TV’s
Powered by 2 power macs with the maximum number of DVI connectors you can have. Makes a great display wall for 1/4 of the cost:). One little word of advice, order the DVI converters with USB power. Or it will not work http://store.apple.com/us/product/MB571Z/A#overview.
Hello world!
Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!
Devops Areas – Codifying devops practices
While working on the Devops Cookbook with my fellow authors Gene Kim,John Willis,Mike Orzen we are gathering a lot of "devops" practices. For some time we struggled with structuring them in the book. I figured we were missing a mental model to relate the practices/stories to.
This blogpost is a first stab at providing a structure to codify devops practices. The wording, descriptions are pretty much work in progress, but I found them important enough to share to get your feedback.
Devops in the right perspective
As you probably know by now, there are many definitions of devops. One thing that occasionally pops up is that people want to change the name to extend it to other groups within the IT area: star-ops, dev-qa-ops, sec-ops, ... From the beginning I think people involved in the first devops thinking had the idea to expand the thought process beyond just dev and ops. (but a name bus-qa-sec-net-ops would be that catchy :).
I've started reffering to :
- devops : collaboration,optimization across the whole organisation. Even beyond IT (HR, Finance...) and company borders (Suppliers)
- devops 'lite' : when people zoom in on 'just' dev and ops collaboration.
As rightly pointed out by Damon Edwards , devops is not about a technology , devops is about a business problem. The theory of Contraints tells us to optimize the whole and not the individual 'silos'. For me that whole is the business to customer problem , or in lean speak, the whole value chain. Bottlenecks and improvements could be happen anywhere and have a local impact on the dev and ops part of the company.
So even if your problem exists in dev or ops, or somewhere between, the optimization might need to be done in another part of the company. As a result describing pre-scriptive steps to solve the 'devops' problem (if there is such a problem) are impossible. The problems you're facing within your company could be vastly different and the solutions to your problem might have different effects/needs.
If not pre-scriptive, we can gather practices people have been doing to overcome similar situations. I've always encouraged people to share their stories so other people could learn from them. (one of the core reasons devopsdays exists) This helps in capturing practices, I'd leave it in the middle to say that they are good or best practices.
Currently a lot of the stories/practices are zooming in on areas like deployment, dev and ops collaboration, metrics etc.. (Devops Lite) . This is a natural evolution of having dev and ops in the term's name and given the background of people currently discussing the approaches. I hope that in the future this discussion expands itself to other company silos too: f.i. synergize HR and Devops(Spike Morelli) or relate our metrics to financial reporting.
Another thing to be aware of is that a system/company is continously in flux: whenever something changes to the system it can have an impact; So you can't take for granted that problems,bottle-necks will not re-emerge after some time. It needs continuous attention. That will be easier if you get closer to a steady-state, but still, devops like security is a journey, not an end state.

Beyond just dev and ops
Let's zoom in on some of the practices that are commonly discussed: the direct field between 'dev' and 'ops'.
In most cases, 'dev' actually means 'project' and 'ops' presents 'production'. Within projects we have methodologies like (Scrum, Kanban, ...) and within operations (ITIL, Visble Ops, ...). Both parts have been extending their project methodology over the years: from the dev perspective this has lead to 'Continous Delivery' and from the Ops side ITIL was extended with Application Life Cycle (ALM). They both worked hard on optimize the individual part of the company and less on integration with other parts. Those methodologies had a hard time solving a bottleneck that outside their 'authority'. I think this where devops kicks in: it seeks the active collaboration between different silos so we can start seeing the complete system and optimize where needed, not just in individual silos.

Devops Areas
In my mental model of devops there are four 'key' areas:
- Area 1 : Extend delivery to production (Think Jez Humble) : this is where dev and ops collaborate to improve anything on delivering the project to production
- Area 2 : Extend Operation to project (Think John Allspaw) : all information from production is radiated back to the project
- Area 3 : Embed Project(Dev) into Operations : when the project takes co-ownership of everything that happens in production
- Area 4 : Embed Production(Ops) into Project : when operations are involved from the beginning of the project
In each of these areas there will be a bi-directonal interaction between dev and ops, resulting in knowledge exchange and feedback.
Depending on where your most pressing 'current' bottleneck manifests itself, you may want to address things in different areas. There is no need to first address things in area1 than area2. Think of them as pressure points that you can stress but requiring a balanced pressure.
Area 1 and Area2 tend to be heavier on the tools side , but not strictly tools focused. Area3 and Area4 will be more related to people and cultural changes as their 'reach' is further down the chain.
When visualized in a table this gives you:

As you can see:
- the DEV and OPS part keep having their own internal processes specific to their job
- the two processes are becoming aligned and the areas extend both DEV and OPS to production and projects
- it's almost like a double loop with area1 and area2 as the first loop and area3 and area4 as the second loop
Note 1: these areas definitely need 'catchier' names to make them easier to remember. Note 2: Ben Rockwoods post on "The Three Aspects of Devops" lists already 3 aspects but I think the areas make it more specific

Area Layers
In each of these areas, we can interact at the traditional 'layers' tools, process, people:
So whenever I hear story , I try to relate it's practice to one of these areas as described above and the layer it's adressing. Practices can have an impact at different layers so I see them as 'tags' to quickly label stories. Another benefit is that whenever you look at an area, you can ask yourself what practices we can do to improve each of these layers. To have a maximum impact on each of the layers, it's clear that the approach needs to be layered in all three.
The ultimate devops tools would support the whole people and process in all of these areas, not just in Area1 (deployment) or Area2 (monitoring/metrics). Therefore a devops toolchain with different tools interacting in each of the areas makes more sense. Also the tool by itself doesn't make it a devops tool: configuration mangement systems like chef and puppet are great, but when applied in Ops only don't help our problem much. Of course Ops gets infrastructure agilitity, but it isn't until it is applied to the delivery (f.i. to create test and development environments) that it becomes 'devops'. This shows that the mindset of the person applying the tool makes it a devops tool, not the tool by itself.

Area Maturity Levels
Now that we have the areas and layers identified, we want to track progress as we start solving our problems and are improving things.
Adrian Cockroft suggested using CMMI levels for devops:
CMMI levels allow you to quantify the 'maturity' of your process. That addresses only one layer (although an equally important one). In a nutshell CMMI describes the different levels as:
- Initial : Unpredictable and poorly controlled process and reactive nature
- Managed : Focused on project and still reactive nature
- Defined : Focused on organization and proactive
- Quantively Managed : Measured and controller approach
- Optimizing : Focus on Improvement
All these levels could be applied to dev , ops or devops combined. It gives you an idea at what level process is in, while you are optimizing in an area.
An alternative way of expressing maturity levels is used by the Continuous Integration Maturity Model. It puts a set of practices in levels of maturity: (industry consensus)
- Intro : using source control ...
- Novice : builds trigger by commit ...
- Intermediate : Automated deployment to testing ..
- Advanced : Automated Functional testing ...
- Insane : Continuous Deployment to Production ...
Instead of focusing on the proces only , it could be applied to a set of tools, process or people practices. What people consider the most advanced would get the highest maturity level.
Practices, Patterns and principles
A practice could be anything from an anecdotal item to a systemic approach. Similar practices can be grouped into patterns to elevate them to another level. Similar to the Software Design Patterns we can start grouping devops practices in devops patterns.
Practices and patterns will rely on principles and it's these underlying principles that will guide you when and you to apply the pattern or practice. These principles can be 'borrowed' from other fields like Lean, Systems Theory etc, Human Psychology. The principles are what the agile manifesto is about for example.
Slowly we will turn the practices -> patterns -> principles .
Note: I'm wondering if there will be new principles that will emerge from from devops itself or it will be apply existing principle to a new perspective.
A few practical examples:
Below are a few example 'practices' codified in a standard template. The practices/patterns/principles are not yet very well described. The point is more that this can serve as a template to codify practices.

Area Indicators
The idea is to list metrics/indicators that can tracked. The numbers as such might be not be too relevant but the rate of change would be. This is similar to tracking the velocity of storypoints or the tracking of mean time to recovery.
Note: I'm scared of presenting these as metrics to track, therefore I call them indicators to soften that.
Examples would be :
- Tools Layer : Deploys/Day
- Process Layer : Number of Change Requests/Day
- People Layer : People Involved per deploy
This is not yet fleshed out enough , I'm guessing it will be based on my research done for my Velocity 2011 Presentation (Devops Metrics)
Devops Scorecard
To present progress during your 'devops' journey you can put all these things in a nice matrix, to get an overview on where you are at optimizing at the different layers and areas.
Obviously this only makes sense if you don't lie to yourself, your boss, your customers.

Project Teams, Product Teams and NOOPS
Jez Humble often talks about project teams evolving to product teams: largere silos will split of not by skill, but for product functionality they are delivering. Splitting teams like that, has the potential danger of creating new silos. It's obvious these product teams need to collaborate again. You should treat other product teams are external dependencies, just like other Silos. The areas of interaction will be very similar.
Also you can see the term NOOPS as working with product teams outside your company, like you rely on SAAS for certain functions. It's important not only to integrate in each of the areas on the tools layer, but also on the people and process layer. Something that is often forgotten. Automation and abstraction allows you to go faster but when things fail or even changes occur, synchronisation needs to happen.
CAMS and areas
The CAMS acronym (Culture, Automation, Measurement, Sharing) could be loosely mapped onto the areas structure:
- Automation seems to map to Area1: the delivery process
- Measurement seems to map to Area2: the feedback process
- Culture to Area3 : embedded devs in Production
- Sharing to Area4: embedded ops in Projects
Of course automation, measurement, culture and sharing can happen in any of the areas, but some of the areas seem to have a stronger focus on each of these parts.

Conclusion
Devops areas, layers and maturity levels, give us a framework to capture new practices stories and it can be used to identify areas of improvements related to the devops field. I'd love feedback on this. If anyone wants to help, I'd like to bring up a website where people can enter their stories in this structure and make it easily available for anyone to learn. I don't have too much CPU cycles left currently , but I'm happy to get this going :)
P.S. @littleidea: I do want to avoid the FSOP Cycle
Conference time – Summer of 2012
It's the time of year that all conferences are gearing up. Here's a list of conferences I'm speaking or wish I was attending.
ChefConf 12 - May 15-17 : the place to be if you're anything with chef these days
GOTOCon Copenhagen - May 21-23 (me speaking) : fun conference and very well organized although a bit too static to my taste.
Devopsdays Tokyo - May 26: Tokyo was always on my list, I can't go , bummers. Botchagalupe is winning :)
Atlassian Summit - May 30,June 1 (me speaking) : really proud to be opening the devops track at my current employer. First time my employer has an explicit interest in devops. Go-go atlassian!
Kanban for Devops , Belgium June 18-19: initially announced that I would be there, and I was very keen on doing so. Work got in the way, so can't make. But if you can , you should! I'm sure @dominica will get your WIP (that is Work in Progress :)
Velocity - June 25-27 : the uber conference on anything on web and performance
Devopsdays MountainView - June 28-29 : this year at Google, looking forward to so much fun!
Webperfdays - June 28 : interesting unconference happening on performance. Happening at the same time as Devopsdays at Google.
Puppetconf - September 27-28 : and if you're into puppet, or config mgmt in general. A cool place to be , hope I can make it this year
Velocity Europe - October 2-4 : since the success last year, Velocity Europe strikes again: Web Performance isn't a US only concern!
Devopsdays Italy - October 6-7 : Rome, sweet rome - sun and devops - the perfect mix
AppSec USA 2012 - October 23-24 : not 100% sure on this one, but rumors go on a devops track in a security conference - sounds like fun to me.
Busy times .... but .... Fun times!

