↓ Archives ↓

Archive → November, 2010

Package Auditing with Edison

I’ve just committed a new API URL to Edison which enables the storage of Package Name, Version and Repository linked against the ConfigurationItem FQDN. This enables you to create a plugin for your package manager that posts to a URL and inserts into a database the packages it has just installed/updated. As an example, the … Read more

Devops – The War Is Over – if You Want It



These are the slides of my presentations a the Scrum NL: Scrum Operations and the presentations at Xpdays 2010: Devops why should developers care.

The presentation is created out of three parts:

  • An introductory story
  • A comparison on the technical level of infrastructure as code with coding software practices
  • Pointing out the similarities between challenges that Agile Methods and ITIL have.

From Dev/Ops to Devops, What a difference one character makes


Here are the slides of my co-presentation with Mr. DNS @KrisBuytaert both at t-dose 2010 and devoxx 2010.

It was really interesting to bundle our experiences and discussing the subject. We'll be back next year, I'm sure.

Video Q&A: Adam Rosien on how Continuous Deployment relies on great testing

Last night, Adam Rosien of Wealthfront (formerly called KaChing) gave a presentation on Continuous Deployment at the Large Scale Production Engineering Meetup on Yahoo's campus.

Afterwards, I grabbed Adam for a quick chat about a topic that has been troubling me: 

Newcomers to the Continuous Deployment idea often overlook the importance of fully automated testing and instead focus their attention on the number of deployments these companies are making per day.

In the video below, Adam stresses how important a role testing plays in Continuous Deployment. Testing really is the linchpin on which a successful implementation of the Continuous Deployment methodology relies. 

If you are an active reader of this blog or listen to the DevOps Cafe podcast, you probably know by now that we are big fans of the process, tooling, and culture being cultivated by the folks at Wealthfront. Previous content featuring them can be found here, here, and here.

They are not clients of any of the contributors to this blog. We just think that they are a great example of a scrappy company rethinking IT Operations to maximize business value and agility. Their engineering blog is a must read. 

 

Video Q&A: Aaron Peterson and Kevin Gray on self-healing infrastructure

At LISA 2010, I caught up with Aaron Peterson (Opscode) and Kevin Gray (Dyn) after they gave a very interesting presentation/demo called "DevOps Gameday".

From the title, I think a number of attendees were expecting to see the standard Dev to Ops promotion/deployment of code that is so common to the DevOps discussion. Instead the presenters (Opscode, Zenoss, Dyn Inc.) focused on what happens when you have a failure after the code has been deployed. This demo was about self-healing infrastructure... breaking a multi-node system and having it heal itself.

Of course, this kind of canned demo isn't all that new in the vendor world. However, what is very interesting about their efforts is they want to capture the best practices required to do it and share the code with the world through their combined project (hosted on GitHub). 

If they fulfill the mission of their open project, it's exactly the kind of "here is how you can do what the big players do" sharing that is good for our industry. 

 

Getting diffs for Puppet catalogs

Puppet compiles its manifests into a catalog, the catalog is derived from your code and is something that can be executed on your node.

This model is very different from other configuration management system that tend to execute top down and just run through the instructions in a very traditional manner.

Having a compiled artifact has many advantages most of which aren’t really exposed to users today, I have a lot of ideas on how I would like to use the catalog – and the graph it contains. The first idea is to be able to compare them and identify changes between versions of your code.

For this discussion I’ll start with the code below:

class one {
    file{"/tmp/test": content => "foo" }
}
 
class two {
    include one
 
    file{"/tmp/test1": content => "foo";
 
         "/tmp/test2": content => "foo";
    }
}
 
include two

When I run it I get 3 files:

-rw-r--r-- 1 root root 3 Nov 14 11:32 /tmp/test
-rw-r--r-- 1 root root 3 Nov 14 11:32 /tmp/test1
-rw-r--r-- 1 root root 3 Nov 14 11:31 /tmp/test2

Being able to diff the catalog has a lot of potential. Often when you look at a diff of code it’s hard to know what the end result would be, especially if you use inheritance heavily or if your code relies on external data like from extlookup. Since the puppet master now supports compiling catalogs and spitting them out to STDOUT you also have the possibility to compile machine catalogs on a staging master and compare it against the production catalog without any risk.

The other use case could be during major version upgrades where you wish to validate the next release of Puppet will behave the same way as the old one. We’ve had problems in the past where 0.24.x would evaluate templates differently from later versions and you get unexpected changes being rolled out to your machines.

Lets make a change to our code above, here’s the diff of our change:

--- test.pp     2010-11-14 11:35:57.000000000 +0000
+++ test2.pp    2010-11-14 11:36:06.000000000 +0000
@@ -5,6 +5,8 @@
 class two {
     include one
 
+    File{ mode => 400 }
+
     file{"/tmp/test1": content => "foo";
 
          "/tmp/test2": content => "foo";

This is the kind of thing you’ll see in mail if you have your SCM set up to mail diffs or while sitting in a change control meeting. The change looks simple enough you want to just change the mode of /tmp/test1 and /tmp/test2 to 400 rather than the default.

When you run this code though you’ll see that /tmp/test also change! This is because setting defaults applies to included classes too, and this is exactly the kind of situation that is very hard to pick up from diffs and to be able to guess the full impact of the change.

My diff tool will have shown you this (format slightly edited):

Resource counts:
        Old: 516
        New: 516
 
Catalogs contain the same resources by resource title
 
 
Individual Resource differences:
Old Resource:
        file{"/tmp/test": content => acbd18db4cc2f85cedef654fccc4a4d8 }
New Resource:
        file{"/tmp/test": mode => 400, content => acbd18db4cc2f85cedef654fccc4a4d8 }
 
Old Resource:
        file{"/tmp/test1": content => acbd18db4cc2f85cedef654fccc4a4d8 }
New Resource:
        file{"/tmp/test1": mode => 400, content => acbd18db4cc2f85cedef654fccc4a4d8 }
 
Old Resource:
        file{"/tmp/test2": content => acbd18db4cc2f85cedef654fccc4a4d8 }
New Resource:
        file{"/tmp/test2": mode => 400, content => acbd18db4cc2f85cedef654fccc4a4d8 }

Here you can clearly see all 3 files will be changed and not just two. With this information you’d be much better off in your change control meeting than before.

The diff tool works in a bit of a round about manner and I hope to improve the usage a bit in the near future. First you need to dump the catalogs into a format unique to this tool set and finally you can diff this intermediate format. The reason for this is that you can compare catalogs from different versions of puppet code so you need to go via an intermediate format.

There’s one thing worth noting. I initially wrote it to help with a migration from 0.24.8 to 0.25.x or even 2.6.x and in my initial tests this seemed fine but on more extensive testing with bigger catalogs I noticed a number of strange things in the 0.24.x catalog format. First it doesn’t contain all the properties for Defined Types and 2nd it sets a whole lot of extra properties on resources filling in blanks left by the user.

What this means is that if you diff a 0.24.x catalog vs the same code on newer versions you’ll likely see it complain that all your defined type resources are missing from the 0.24 catalog and you might also get some false positives on resource diffs. I can’t do much about the missing resources but what I can do is clear up the false positives, I already handle the ones in my manifests but there are no doubt more if you let me know of them I’ll see about working around them too.

The code for this can be found in my GitHub account. It’s still a bit of a work in progress as I haven’t actually done my migration yet so subscribe to the repo there’s likely to be frequent changes still.

Edison gets rudimentary templating for kickstarts…

Edison now has basic support for templating in kickstart/FAI files: Using the templates http://edison/api/kickstart/ – returns the value from the AutoInstallFile field on the Configuration Item Profile when sent the X-RHN-Provisioning-Mac-0 header The kickstart output is based upon the value in the ConfigurationItemProfile.AutoInstallFile field. There is now support for rudimentary templating: <<hostname>> is replaced by … Read more

Fix it or Kick It and the ten minute maxim

One of the things I brought up in my presentation to the Atlanta DevOps group was the concept of "Payment". One of the arguments that people like to trot out when you suggest an operational shift is that "We can't afford to change right now". My argument is that you CAN'T afford to change. It's going to cost you more in the long run. The problem is that in many situations, the cost is detached from the original event.

Take testing. Let's assume you don't make unit testing an enforced part of your development cycle. There are tons of reasons people do this but much of it revolves around time. We don't have time to write tests. We don't have time to wait for tests to run. We've heard them all. Sure you get lucky. Maybe things go out the door with no discernible bugs. But what happens 3 weeks down the road when the same bug that you solved 6 weeks ago crops up again? It's hard to measure the cost when it's so far removed from the origination.

Configuration management is the same way. I'm not going to lie. Configuration management is a pain in the ass especially if you didn't make it a core concept from inception. You have to think about your infrastructure a bit. You'll have to duplicate work initially (i.e. templating config files). It's not easy but it pays off in the long run. However as with so many things, the cost is detached from the original purchase.

Fix it?

Walk with me into my imagination. A scary place where a server has started to misbehave. What's your initial thought? What's the first thing you do? You've seen this movie and done this interview:

  • log on to the box
  • perform troubleshooting
  • think
  • perform troubleshooting
  • call vendor support (if it's an option)
  • update trouble ticket system
  • wait
  • troubleshoot
  • run vendor diag tools

What's the cost of all that work? What's the cost of that downtime? Let's be generous. Let's assume this is a physical server and you paid for 24x7x4 hardware support and a big old RHEL subscription. How much time would you spend on each task? What's the turn around time to getting that server back into production?

Let's say that the problem was resolved WITHOUT needing replacement hardware but came in at the four hour mark. That's three hours that the server was costing you money instead of making you money. Assuming a standard SA salary of $75k/year in Georgia, that works out to $150. That's just doing a base salary conversion not calculating all the other overhead associated with staffing an employee. What if that person consulted with someone else during that time, a coworker at the same rate, for two of those hours. $225. Not too bad, right? Still a tangible cost. Maybe one you're willing to eat.

But let's assume the end result was to wipe and reinstall. Let's say it takes another hour to get back to operational status. Woops. Forgot to make that tweek to Apache that we made a few weeks ago. Let's spend an hour troubleshooting that.

But we're just talking man power at this point. This doesn't even take into account end-user productivity, loss of customers from degraded performance or any host of issues. God forbid that someone misses something that causes problems to other parts of the environment (like not setting the clock and inserting invalid timestamps into the database or something. Forget that you shouldn't let your app server handle timestamps). Now there's cleanup. All told your people spent 5 hours to get this server back into production while you've been running in a degraded state. What does that mean when our LOB is financial services and we have an SLA and attached penalties? I'm going to go easy on you and let you off with 10k per hour of degraded performance.

Get ready to credit someone $50k or worse cut a physical check.

Kick it!

Now I'm sure everyone is thinking about things like having enough capacity to maintain your SLA even with the loss of one or two nodes but be honest. How many companies actually let you do that? Companies will cut corners. They roll the dice or worse have a misunderstanding of HA versus capacity planning.

What you should have done from the start was kick the box. By kicking the box, I mean performing the equivalent of a kickstart or jumpstart. You should, at ANY time, be able to reinstall a box with no user interaction (other than the action of kicking it) and return it to service in 10 minutes. I'll give you 15 minutes for good measure and bad cabling. My RHEL/CentOS kickstarts are done in 6 minutes on my home network and most of that time is the physical hardware power cycling. With virtualization you don't even have a discernible bootup time.

Unit testing for servers

I'll go even farther. You should be wiping at least one of your core components every two weeks. Yes. Wiping. It should be a part of your deploy process in fact. You should be absolutely sure that should you ever need to reinstall under duress that you can get that server back into service in an acceptable amount of time. Screw the yearly DR tests. I'm giving you a world where you can perform bi-monthly DR tests as a matter of standard operation. All it takes is a little bit of up front planning.

The 10 minute maxim

I have a general rule. Anything that has to be done in ten minutes can be afforded twenty minutes to think it through. Obviously, it's a general rule. The guy holding the gun might not give you twenty minutes. And twenty minutes isn't a hard number. The point is that nothing is generally so critical that it has to be SOLVED that instant. You can spend a little more time up front to do things right or you can spend a boatload of time on the backside trying to fix it.

Given the above scenario, you would think I'm being hypocritical or throwing out my own rule. I'm not. The above scenario should have never happened. This is a solved problem. You should have spent 20 minutes actually putting the config file you just changed into puppet instead of making undocumented ad-hoc changes. You should have spent an hour when bringing up the environment to stand up a CM tool instead of just installing the servers and doing everything manually. That's the 10 minute maxim. Take a little extra time now or take a lot of time later.

You decide how much you're willing to spend.

Fix it or Kick It and the ten minute maxim

One of the things I brought up in my presentation to the Atlanta DevOps group was the concept of "Payment". One of the arguments that people like to trot out when you suggest an operational shift is that "We can't afford to change right now". My argument is that you CAN'T afford to change. It's going to cost you more in the long run. The problem is that in many situations, the cost is detached from the original event.

Take testing. Let's assume you don't make unit testing an enforced part of your development cycle. There are tons of reasons people do this but much of it revolves around time. We don't have time to write tests. We don't have time to wait for tests to run. We've heard them all. Sure you get lucky. Maybe things go out the door with no discernible bugs. But what happens 3 weeks down the road when the same bug that you solved 6 weeks ago crops up again? It's hard to measure the cost when it's so far removed from the origination.

Configuration management is the same way. I'm not going to lie. Configuration management is a pain in the ass especially if you didn't make it a core concept from inception. You have to think about your infrastructure a bit. You'll have to duplicate work initially (i.e. templating config files). It's not easy but it pays off in the long run. However as with so many things, the cost is detached from the original purchase.

Fix it?

Walk with me into my imagination. A scary place where a server has started to misbehave. What's your initial thought? What's the first thing you do? You've seen this movie and done this interview:

  • log on to the box
  • perform troubleshooting
  • think
  • perform troubleshooting
  • call vendor support (if it's an option)
  • update trouble ticket system
  • wait
  • troubleshoot
  • run vendor diag tools

What's the cost of all that work? What's the cost of that downtime? Let's be generous. Let's assume this is a physical server and you paid for 24x7x4 hardware support and a big old RHEL subscription. How much time would you spend on each task? What's the turn around time to getting that server back into production?

Let's say that the problem was resolved WITHOUT needing replacement hardware but came in at the four hour mark. That's three hours that the server was costing you money instead of making you money. Assuming a standard SA salary of $75k/year in Georgia, that works out to $150. That's just doing a base salary conversion not calculating all the other overhead associated with staffing an employee. What if that person consulted with someone else during that time, a coworker at the same rate, for two of those hours. $225. Not too bad, right? Still a tangible cost. Maybe one you're willing to eat.

But let's assume the end result was to wipe and reinstall. Let's say it takes another hour to get back to operational status. Woops. Forgot to make that tweek to Apache that we made a few weeks ago. Let's spend an hour troubleshooting that.

But we're just talking man power at this point. This doesn't even take into account end-user productivity, loss of customers from degraded performance or any host of issues. God forbid that someone misses something that causes problems to other parts of the environment (like not setting the clock and inserting invalid timestamps into the database or something. Forget that you shouldn't let your app server handle timestamps). Now there's cleanup. All told your people spent 5 hours to get this server back into production while you've been running in a degraded state. What does that mean when our LOB is financial services and we have an SLA and attached penalties? I'm going to go easy on you and let you off with 10k per hour of degraded performance.

Get ready to credit someone $50k or worse cut a physical check.

Kick it!

Now I'm sure everyone is thinking about things like having enough capacity to maintain your SLA even with the loss of one or two nodes but be honest. How many companies actually let you do that? Companies will cut corners. They roll the dice or worse have a misunderstanding of HA versus capacity planning.

What you should have done from the start was kick the box. By kicking the box, I mean performing the equivalent of a kickstart or jumpstart. You should, at ANY time, be able to reinstall a box with no user interaction (other than the action of kicking it) and return it to service in 10 minutes. I'll give you 15 minutes for good measure and bad cabling. My RHEL/CentOS kickstarts are done in 6 minutes on my home network and most of that time is the physical hardware power cycling. With virtualization you don't even have a discernible bootup time.

Unit testing for servers

I'll go even farther. You should be wiping at least one of your core components every two weeks. Yes. Wiping. It should be a part of your deploy process in fact. You should be absolutely sure that should you ever need to reinstall under duress that you can get that server back into service in an acceptable amount of time. Screw the yearly DR tests. I'm giving you a world where you can perform bi-monthly DR tests as a matter of standard operation. All it takes is a little bit of up front planning.

The 10 minute maxim

I have a general rule. Anything that has to be done in ten minutes can be afforded twenty minutes to think it through. Obviously, it's a general rule. The guy holding the gun might not give you twenty minutes. And twenty minutes isn't a hard number. The point is that nothing is generally so critical that it has to be SOLVED that instant. You can spend a little more time up front to do things right or you can spend a boatload of time on the backside trying to fix it.

Given the above scenario, you would think I'm being hypocritical or throwing out my own rule. I'm not. The above scenario should have never happened. This is a solved problem. You should have spent 20 minutes actually putting the config file you just changed into puppet instead of making undocumented ad-hoc changes. You should have spent an hour when bringing up the environment to stand up a CM tool instead of just installing the servers and doing everything manually. That's the 10 minute maxim. Take a little extra time now or take a lot of time later.

You decide how much you're willing to spend.

DevOps is not a technology problem. DevOps is a business problem.

Since Patrick Debois called for the first DevOps Days event and unleashed the term "DevOps" upon the world, there is no denying that DevOps has evolved into a global movement.

Of course, DevOps has its detractors. Negative opinions range from the misguided ("DevOps is a new name for a Sys Admin") to the dismissive ("DevOps is just some crazy Devs trying to get rid of Ops" or "DevOps is just crazy some Ops trying to act like Devs so they will be better liked") to the outright offended (whose arguments tend to defy logic).

I've spent the past nine months or so overcoming resistance to the DevOps movement in both public forums and inside client companies. During that time, I've begun to notice a common misconception that I think is fueling much of the negative initial reaction that some people have to DevOps ideas. I want to take a shot at clearing it up now:

DevOps is not a technology problem.

Technology plays a key part in enabling solutions to DevOps problems. However, DevOps itself is fundamentally a business problem.

What does the business have to do with DevOps?

The most fundamental business process in any company is getting an idea from inception to where it is making you money.

 

Within that business process there are all kinds of activity that needs to happen, some technology-driven and some human-driven. This is where all of the different functions of IT come into play. Developers, QA, Architecture, Release Engineering, Security, Operations, etc each do their part to fulfill that process.

But if you take away the context of the business process, what have you got? You've got a bunch of people and groups doing their own thing. You lose any real incentive to fight inefficiency, duplication of effort, conflicts, and disconnects between those groups. It's every person for themselves, literally.

But you know what else happens if you remove the context of the business process? Your job eventually goes away. Enabling the business is why we get paychecks and why we get to spend our time doing what we do.

If there is no business to enable or we don't do any business enabling, this all just turns into a hobby. And by definition, it's pretty difficult to get paid for a hobby. 

The whole point of DevOps is to enable your business to react to market forces as quickly, efficiently, and reliably as possible. Without the business, there is no other reason for us to be talking about DevOps problems, much less spending any time solving them.

  

 

Doesn't this sound a lot like the goals of Agile?

If the goals of DevOps sound similar to the goals of Agile, it's because they are. But Agile and DevOps are different things. You can be great at Agile Development but still have plenty of DevOps issues. On the flip side of that coin, you could do a great job removing many DevOps issues and not use Agile Development methodologies at all (although that is increasingly unlikely).

I like to describe Agile and DevOps as being related ideas, who share a common Lean ancestry, but work on different planes. While Agile deep dives into improving one major IT function (delivering software), DevOps works on improving the interaction and flow across IT functions (stretching the length of the entire development to operations lifecycle).

 

But I thought DevOps was all about cool tools?

Technology is the great enabler for making almost any business process more efficient, scalable, and reliable. However, we have to remember that on their own, tools are just tools.

It's just as likely that you'll use a new tool to reinforce bad habits and old broken processes as you will to improve your organization. It's the desired effect on the business process you are supporting that dictates why and how a tool is best used.

When people are clear on what their DevOps problems are and what process improvements need to happen to remedy those DevOps problems, the tool discussion becomes rather straightforward (if not trivial). 

Since the nascent DevOps movement is mostly made up of technologists, it's easy to understand why there is such excitement to jump right into tooling discussions. But perhaps we need to do more to make sure that everyone is up to speed on why the tools are needed and what the desired business process improvements are before diving into our standard "Puppet vs. Chef" or "Files Centric vs. Package Centric" arguments.

 

If DevOps is about a business process then why is it called "DevOps"?

In my opinion, one fault of the early DevOps conversations is that it wasn't immediately clear just how big the scope of this problem truly is. Now that we have a year of perspective behind us, it turns out that we have been attacking one of the biggest problems in all of business: How to enable a business to react to market forces as quickly as possible.

But alas, the conversation had to start somewhere so it gravitated to the almost universal problem of conflict and disconnect between Developer culture and Operations culture. Every org chart is different, but it's fairly easy to cartoonishly divide the world into a Dev camp and Ops camp for the sake of having common reference points to discuss (even though we all know the world is much more complex and grey).

Within that cartoon Dev/Ops example, most of the early DevOps attention has been about improving deployment activity. Since change activity makes up the bulk of the work across an IT organization, that too was a logical and natural place to start.

Maybe Patrick should have called that first event "BizDevQASecurityOpsCloudUsers Days" or "SolvingABroaderProblemThanAgile Days"... But I doubt anyone would have shown up.