↓ Archives ↓

Archive → April, 2011

Murder for fun and profit

Deploying web applications can be a real nightmare at times, especially when you have numerous SVN repositories of code which all link together when installed on the server to create your application. I’ve started using Murder to try and work around the headaches and apart from a very small issue (which I’ll discuss at the […]

cucumber-vhost gets a config file…

After reading the thread in the Devops-Toolchain Google Group (http://bit.ly/devops-vmth), I realised it was about time I dusted down Cucumber-Vhost and gave it a quick once-over. The main addition tonight is way overdue and is the simple addition of a configuration file.  I chose YAML for the config file because XML is not a human […]

Kanban for Sysadmin

This article was originally published in December 2009, in Jordan Sissel's SysAdvent

Unless you've been living in a remote cave for the last year, you've probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we're beginning to hear phrases like 'Infrastructure is code', and terms like 'Devops'. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to 'Agile' principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.

I've been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.

Operations teams need to deliver business value

As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform -- is the product -- is the revenue. Especially in tough economic times it's vital to make sure that as sysadmins we are adding value to the business.

In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.

The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.

The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.

Systems teams starting to work alongside such development teams are often tempted to try the same approach.

The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn't fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It's not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn't be delivered in such a short space of time.

Dan Ackerman recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work[1]. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn't seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn't work well for operations - we're solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it's clear that the same tools won't necessarily be appropriate.

What is Kanban, and how might it help?

Let's keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.

Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.

As sysadmins we are not generally delivering product, in the sense that a development team are. We're supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.

Supporting tools

Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.

The word Kanban itself means 'Signal Card' - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile 'story card'. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.

The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..

Kanban teams abandon the concept of iterations altogether. As Andrew Clay Shafer once said to me: "We will just work on the highest priority 'stuff', and kick-ass!"

The Radisson Edwardian

How does Kanban help?

Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.

Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There's no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working for them.

Quality is improved because the WIP limit makes problems visible very quickly. Let's consider two examples - suppose we have a team of four sysadmins:

The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the 'in progress' section of the board, and the flow of work will be too slow. Also it won't always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.

Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there's only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don't concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be 'stuck' on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team's productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.

The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team's working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that's a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.

Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota's lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.

Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there's one thing sysadmins hate, it's being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.

How do I get started?

If you think this sounds interesting, here are some suggestions for getting started.

  • Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: "Let's try this for a month - if you don't feel it's working out, we'll go back to the way we work now".

  • Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.

  • Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn't matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.

  • Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.

  • Agree your WIP limit amongst yourselves - it doesn't matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.

  • Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You'll end up with a huge stack of cards - I keep them in a card box, next to the board.

  • Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there's a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a 'Next Please' section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.

  • Write up a team charter - decide on the rules. You might agree not to work on other people's cards without asking first. You might agree times of the day you'll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it's done. And nobody works on anything that isn't on the board. Write the charter up, and get the team to sign it.

  • Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don't need to ask who is working on what, or what they're going to work on next - that's already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the 'Next Please' section.

  • Set up a ticketing system. I've used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that's going to be longer needs to go on the board. We have a dedicated 'Support' section on our board, with a WIP limit. If there are more support requests than slots on the board, it's up to the requestors to agree amongst themselves which has the greatest business value (or cost).

  • Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using 'SWOT' (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking 'Five Whys' - keep asking why until you really get to the root cause. Also try to ensure you fix things 'Three ways'. These habits are part of a practice called 'Kaizen' - continuous improvement. They feed into your Kanban process, and make everyone's life easier, and improve the quality of the systems you're supporting.

The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:

  • http://limitedwipsociety.org - the home of Kanban for software development; A central place where ideas, resources and experiences are shared.
  • http://finance.groups.yahoo.com/group/kanbandev - the mailing list for people deploying Kanban in a software environment - full of very bright and experienced people
  • http://www.agileweboperations.com - excellent blog covering all aspects of agile operations from a devops perspective

[1]http://www.agileweboperations.com/what-do-sysadmins-really-think-about-agile/

Kanban for Sysadmin

This article was originally published in December 2009, in Jordan Sissel’s SysAdvent

Unless you’ve been living in a remote cave for the last year, you’ve probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we’re beginning to hear phrases like ‘Infrastructure is code’, and terms like ‘Devops’. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to ‘Agile’ principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.

I’ve been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.

Operations teams need to deliver business value

As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform – is the product – is the revenue. Especially in tough economic times it’s vital to make sure that as sysadmins we are adding value to the business.

In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.

The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.

The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.

Systems teams starting to work alongside such development teams are often tempted to try the same approach.

The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn’t fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It’s not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn’t be delivered in such a short space of time.

Dan Ackerman recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work1. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn’t seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn’t work well for operations - we’re solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it’s clear that the same tools won’t necessarily be appropriate.

What is Kanban, and how might it help?

Let’s keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.

Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.

As sysadmins we are not generally delivering product, in the sense that a development team are. We’re supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.

Supporting tools

Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.

The word Kanban itself means ‘Signal Card’ - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile ‘story card’. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.

The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..

Kanban teams abandon the concept of iterations altogether. As Andrew Clay Shafer once said to me: “We will just work on the highest priority ‘stuff’, and kick-ass!”

The Radisson Edwardian

How does Kanban help?

Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.

Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There’s no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working for them.

Quality is improved because the WIP limit makes problems visible very quickly. Let’s consider two examples - suppose we have a team of four sysadmins:

The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the ‘in progress’ section of the board, and the flow of work will be too slow. Also it won’t always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.

Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there’s only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don’t concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be ‘stuck’ on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team’s productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.

The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team’s working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that’s a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.

Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota’s lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.

Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there’s one thing sysadmins hate, it’s being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.

How do I get started?

If you think this sounds interesting, here are some suggestions for getting started.

  • Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: “Let’s try this for a month - if you don’t feel it’s working out, we’ll go back to the way we work now”.

  • Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.

  • Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn’t matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.

  • Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.

  • Agree your WIP limit amongst yourselves - it doesn’t matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.

  • Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You’ll end up with a huge stack of cards - I keep them in a card box, next to the board.

  • Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there’s a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a ‘Next Please’ section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.

  • Write up a team charter - decide on the rules. You might agree not to work on other people’s cards without asking first. You might agree times of the day you’ll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it’s done. And nobody works on anything that isn’t on the board. Write the charter up, and get the team to sign it.

  • Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don’t need to ask who is working on what, or what they’re going to work on next - that’s already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the ‘Next Please’ section.

  • Set up a ticketing system. I’ve used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that’s going to be longer needs to go on the board. We have a dedicated ‘Support’ section on our board, with a WIP limit. If there are more support requests than slots on the board, it’s up to the requestors to agree amongst themselves which has the greatest business value (or cost).

  • Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using ‘SWOT’ (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking ‘Five Whys’ - keep asking why until you really get to the root cause. Also try to ensure you fix things ‘Three ways’. These habits are part of a practice called ‘Kaizen’ - continuous improvement. They feed into your Kanban process, and make everyone’s life easier, and improve the quality of the systems you’re supporting.

The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:

  • http://limitedwipsociety.org - the home of Kanban for software development; A central place where ideas, resources and experiences are shared.
  • http://finance.groups.yahoo.com/group/kanbandev - the mailing list for people deploying Kanban in a software environment - full of very bright and experienced people
  • http://www.agileweboperations.com - excellent blog covering all aspects of agile operations from a devops perspective

1http://www.agileweboperations.com/what-do-sysadmins-really-think-about-agile/

Kanban for Sysadmin

This article was originally published in December 2009, in Jordan Sissel’s SysAdvent

Unless you’ve been living in a remote cave for the last year, you’ve probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we’re beginning to hear phrases like ‘Infrastructure is code’, and terms like ‘Devops’. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to ‘Agile’ principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.

I’ve been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.

Operations teams need to deliver business value

As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform – is the product – is the revenue. Especially in tough economic times it’s vital to make sure that as sysadmins we are adding value to the business.

In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.

The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.

The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.

Systems teams starting to work alongside such development teams are often tempted to try the same approach.

The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn’t fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It’s not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn’t be delivered in such a short space of time.

Dan Ackerman recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work1. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn’t seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn’t work well for operations - we’re solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it’s clear that the same tools won’t necessarily be appropriate.

What is Kanban, and how might it help?

Let’s keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.

Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.

As sysadmins we are not generally delivering product, in the sense that a development team are. We’re supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.

Supporting tools

Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.

The word Kanban itself means ‘Signal Card’ - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile ‘story card’. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.

The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..

Kanban teams abandon the concept of iterations altogether. As Andrew Clay Shafer once said to me: “We will just work on the highest priority ‘stuff’, and kick-ass!”

The Radisson Edwardian

How does Kanban help?

Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.

Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There’s no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working for them.

Quality is improved because the WIP limit makes problems visible very quickly. Let’s consider two examples - suppose we have a team of four sysadmins:

The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the ‘in progress’ section of the board, and the flow of work will be too slow. Also it won’t always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.

Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there’s only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don’t concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be ‘stuck’ on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team’s productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.

The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team’s working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that’s a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.

Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota’s lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.

Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there’s one thing sysadmins hate, it’s being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.

How do I get started?

If you think this sounds interesting, here are some suggestions for getting started.

  • Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: “Let’s try this for a month - if you don’t feel it’s working out, we’ll go back to the way we work now”.

  • Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.

  • Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn’t matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.

  • Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.

  • Agree your WIP limit amongst yourselves - it doesn’t matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.

  • Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You’ll end up with a huge stack of cards - I keep them in a card box, next to the board.

  • Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there’s a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a ‘Next Please’ section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.

  • Write up a team charter - decide on the rules. You might agree not to work on other people’s cards without asking first. You might agree times of the day you’ll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it’s done. And nobody works on anything that isn’t on the board. Write the charter up, and get the team to sign it.

  • Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don’t need to ask who is working on what, or what they’re going to work on next - that’s already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the ‘Next Please’ section.

  • Set up a ticketing system. I’ve used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that’s going to be longer needs to go on the board. We have a dedicated ‘Support’ section on our board, with a WIP limit. If there are more support requests than slots on the board, it’s up to the requestors to agree amongst themselves which has the greatest business value (or cost).

  • Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using ‘SWOT’ (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking ‘Five Whys’ - keep asking why until you really get to the root cause. Also try to ensure you fix things ‘Three ways’. These habits are part of a practice called ‘Kaizen’ - continuous improvement. They feed into your Kanban process, and make everyone’s life easier, and improve the quality of the systems you’re supporting.

The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:

  • http://limitedwipsociety.org - the home of Kanban for software development; A central place where ideas, resources and experiences are shared.
  • http://finance.groups.yahoo.com/group/kanbandev - the mailing list for people deploying Kanban in a software environment - full of very bright and experienced people
  • http://www.agileweboperations.com - excellent blog covering all aspects of agile operations from a devops perspective

1http://www.agileweboperations.com/what-do-sysadmins-really-think-about-agile/

Amazon Web Services, Hosting in the Cloud and Configuration Management

Amazon is probably the biggest cloud provider in the industry – they certainly have the most features and are adding more at an amazing rate.

Amongst the long list of services provided under the AWS (Amazon Web Services) banner are:

  • Elastic Compute Cloud (EC2) – scalable virtual servers based on the Xen Hypervisor.
  • Simple Storage Service (S3) – scalable cloud storage.
  • Elastic Load Balancing (ELB) – high availability load balancing and traffic distribution.
  • Elastic IP Addresses – re-assignable static ip addresses to EC2 instances.
  • Elastic Block Store (EBS) – persistant storage volumes for EC2.
  • Relational Database Service (RDS) – scalable MySQL compatible database services.
  • CloudFront – a Content Delivery Network (CDN) for serving content from S3.
  • Simple E-Mail System (SES) – for sending bulk e-mail.
  • Route 53 – high availability and scalable Domain Name System (DNS).
  • CloudWatch – monitoring of resources such as EC2 instances.

Amazon provides these services in 5 different regions:

  • US East (North Virginia)
  • US West (North California)
  • Europe (Ireland)
  • Asia Pacific – Tokyo
  • Asia Pacific – Singapore

Each region has it’s own pricing and features available.

Within each region, Amazon provides multiple “Availability Zones”. These different zones are completely isolated from each other – probably in separate data centers, as Amazon describes them as follows:

Q: How isolated are Availability Zones from one another?
Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.

However, unless you have been offline for the past few days, you will have no doubt heard about the extended outage Amazon has been having in their US East region. The outage started on Thursday, 21st April 2011) taking down some big name sites such as Reddit, Quora, Foursquare & Heroku and the problems are still ongoing now, nearly 2 days later – with Reddit and Quora still running in an impaired state.

I have to confess, my first reaction was that of surprise that such big names didn’t have more redundancy in place – however, once more information came to light, it became apparent that the outage was affecting multiple availability zones – something Amazon seems to imply above shouldn’t happen.

You may well ask why such sites are not split across regions to give more isolation against such outages. The answer to this lies in the implementation of the zones and regions in AWS. Although isolated, the zones within a single region are close enough together that low cost, low latency links can be provided between the different zones within the same region. Once you start trying to run services across regions, all inta-region communication will go over the normal internet and is therefore comparatively slow, expensive and unreliable so it becomes much more difficult and expensive to keep data reliably syncronised. This coupled with Amazon’s above claims about the isolation between zones and best practises has lead to the common setup being to split services over multiple availability zones within the same region – and what makes this outage worst is that US East is the most popular region due to it being a convenient location for sites targeting both the US and Europe.

On the back of this, there are many people are giving both Amazon and cloud hosting a good bashing in both blog posts and on Twitter.

Where Amazon has let everyone down in this instance is that they let a problem (which in this case is largely centered around EBS) to affect multiple availability zones and thus screwing everyone who either had not implemented redundancy or had followed Amazon’s own guidelines and assurances of isolation. I also believe that their communication has been poor and had customers been aware it would take so long to get back online, they may have been in a position to look at measures to get back online much sooner.

In reality though, both Amazon and cloud computing less to do with this problem and more specifically the blame associated with it. At the end of the day, we work in an industry that is susceptible to failure. Whether you are hosting on bare metal or in the cloud, you will experience failure sooner or later and part of the design of any infrastructure you need to take that into account. Failure will happen – it’s all about mitigating the risk of this failure through measures like backups and redundancy. There is a trade-off between the cost, time and complexity of implementing multiple levels of redundancy verses the risk of failure and downtime. On each project or infrastructure setup, you need to work out where on this sliding scale you are.

In my opinion, cloud computing provides us an easy way out of such problems. Cloud computing gives us the ability to quickly spin up new services and server instances within minutes, pay by the hour for them and destroy them when they are no longer required. Gone are the days of having to order servers or upgrades and wait in a queue for a data center technician to deal with hardware. It was the norm to incur large setup costs and/or get locked into contracts. In the cloud, instances can be resized, provisioned or destroyed in minutes and often without human intervention as most cloud computing providers also provide an API so users can handle the management of their services programatically. Under load, instances can be upgraded or additional instances brought online and in quiet periods, instances can be downgraded or destroyed, yielding a significant cost saving. Another huge bonus is that instances can be spun up for development, testing or to perform an intensive task and thrown away afterwards.

Being able to spin new instances up in minutes is however less effective if you have to spend hours installing and configuring each instance before it can perform it’s task. This is especially true if more time is wasted chasing and debugging problems because something is setup differently or missed during the setup procedure. This is where configuration management tools or the ‘infrastructure as code’ principles come in. Tools such as Puppet and Chef were created to allow you to describe your infrastructure and configuration in code and have machines or instances provisioned or updated automatically.

Sure, with virtual machines and cloud computing, things have got a little easier by easily allowing re-usable machine images. You can setup a certain type of system once and re-use the image for any subsequent systems of the same type. This is however greatly limiting in that it’s very time consuming to then later update that image with small changes, to cope with small variations between systems and almost impossible to keep track of what changes have been made to which instances.

Configuration Management tools like Puppet and Chef manage system configuration centrally and can:

  • Be used to provision new machines automatically.
  • Roll out a configuration change across a number of servers.
  • Deal with small variations between systems or different types of systems (web, database, app, dns, mail, development etc).
  • Ensure all systems are in a consistant state.
  • Ensure consistency and repeatability.
  • Easily allow the use of source code control (version control) systems to keep a history of changes.
  • Easily allow the provisioning of development and staging environments which mimic production.

As time permits, i’ll publish some follow up posts which go into Puppet and Chef in more detail and look at how they can be used. I’ll also be publishing a review of James Turnbull’s new book, Pro Puppet which is due to go to print at the end of the month.

Highway to the Availability Zone

by @mattokeefe

With apologies to Kenny Loggins’ Danger Zone (lyrics)…


Revvin’ up your VM
Listen to her disk roar
EBS under tension
Beggin’ you to touch and go

Highway to the Availability Zone
Right into the Danger Zone

Headin’ into Cloud
Spreadin’ out her apps tonight
She got you jumpin’ off the deck
And shovin’ into oversubscription

Highway to the Availability Zone
I’ll take you
Right into the Danger Zone

AWS will never say hello to you
Until you get it on the scale of Netflix
You’ll never know what you can do
Until you deploy across three Zones

Out along the edge
Always where I burn to be
The further on the edge
Higher Akamai profitability

Highway to the Availability Zone
Gonna take you
Right into the Danger Zone
Highway to the Availability Zone

Seriously though… Nothing to see here, move along concerning today’s AWS EC2 outage. Many Enterprises run the same risk with internal IT if they are not redundant across active/active data centers or with proven and regularly tested failover capability.

The good news with the Cloud is that everything has an API, so you can automate your Disaster Recovery / Business Continuity process. Regularly snapshot your EBS volumes if on EC2, and recover from S3 (designed to provide 99.999999999% durability) in another Zone/Region. Tools like Cfengine, Puppet and Chef can help you recreate your entire infrastructure from source control in minutes rather than hours.

Also consider cloud management solutions such as enStratus, RightScale et al to abstract cloud provider details and provide multi-cloud redundancy, including your own private internal clouds if you choose to create one or more. Or, roll your own solution using jclouds or something similar.

Remember, *you* own your availability.


Today’s EC2 / EBS Outage: Lessons learned

Today Britain woke to the news that Amazon Web Services had suffered a major outage in its US East facility. This affected Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. The cause of the outage appears to have been a case of so-called 'auto-immune disease'. Amazon's automated processes began remirroring a large number of EBS volumes, which had a knock on effect of significantly degrading EBS (and thus RDS) performance and availability across multiple availability zones. Naturally the nay-sayers were out in force, decrying cloud-based architectures as doomed to failure from the very start. As the dust starts to settle, we attempt to distill some lessons from the outage.

Expect downtime

The first and most obvious point to make is that downtime is inevitable. Clouds fail. Datacenters fail. Disasters happen. The people trying to make some causal relationship between deploying to the cloud and general failure are missing the point.

What matters is how you respond to downtime. At Atalanta Systems we challenge our clients to switch off machines at random. If their architecture isn't built to withstand failure, we've failed in helping them. Incidentally we've been doing this for years, long before anyone ever mentioned 'chaos monkeys'.

Especially in a cloudy world, expect failure - EC2 instances can and will randomly crash. Expect this, and you won't be disappointed. From day one, expect hardware problems, expect network problems, expect your availability zone to break.

Now, of course, there's a big difference between switching off a few machines or pulling a few cables and losing a whole datacenter. However, we have to expect downtime, and we have to be ready for it. Here are a few suggestions:

Use amazon's built-in availability mechanisms

Don't treat AWS like a traditional datacenter. Amazon provides up to four availability zones per region, and a range of free and paid-for tools for using them. Techniques for taking advantage of these features range from as simple as using elastic IP addresses and remapping manually to a different zone, to using multi-availability zone RDS instances to replicate database updates across zones.

Make use of autoscaling groups, and deploy in more than two availability zones. Latency between zones is minimal, and autoscaling groups can span availability zones, and can be configured to trigger based on utilisation. People maintaining that it costs twice as much to run a highly available infrastructure in AWS simply haven't read the documentation. Take care to avoid the classic fallacy of having three web servers at 60% utilisation, and one failing, resulting in two failing immediately afterwards.

Size your infrastructure to include headroom for load spikes, and to be able to sustain an complete AZ failure. For any business for whom downtime can be measured in tens of pounds per minute (which accounts for even small startups), it's cheaper to build in the availability than to suffer the outage.

The problem with today's outage is that it appears to have impacted multiple availability zones. The full explanation for this has not yet been forthcoming, but it does service to highlight that if availability really matters to you, you really need to consider using multiple regions. Amazon has points of presence on the East coast, the West coast, Western Europe, and two in South East Asia. Backing up to S3 from one region enables restore into another. Cloudwatch triggers can be used to launch new instances in a different region, or even a full stack via Cloud Formation. We have clients doing this on the East and West coast, without spending outrageous amounts of money.

The bottom line is that one of the key benefits of using AWS is the geographic spread it enables, together with its monitoring and scaling and balancing capabilities. Look into using these - if you're not at least exploring these areas, you're doing the equivalent of buying an iPhone and only ever using it for text messages.

Think about your use of EBS

It's not the first time there have been problems with EBS - only last month, Reddit was down for most of the day because of EBS-related issues. Here are a few things to consider when thinking about using EBS in your setup:

  • EBS is not a SAN

EBS is network accessible, block storage. It's more like a NetApp than a fibre-based storage array. Treat it as such. Don't expect to be able to use EBS effectively if your network is saturated. Also be aware that EBS (and the whole of AWS) is built on commodity hardware, and as such is not going to behave in the same way as a NetApp. You're going to struggle to get the kind of performance you'd get from a commercial SAN or NAS, with battery-backed cache, but EBS is considerably cheaper.

  • EBS is multi-tenant

Remember that you're sharing disk space and IO with other people. Design with this in mind. Deploy large volumes, even if you don't need the space, to minimise contention. Consider using lots of volumes and building up your own RAID 10 or RAID 6 from EBS volumes. Think of it as a way to get as many spindles as you can, spread across as many disk-providers as possible. Avoid wherever possible using a single EBS volume - as Reddit found to their cost last month, this is not the right way to use EBS.

  • Don't use EBS snapshots as a backup

EBS snapshots are a very handy feature, but they are not backups. Although they are available to different availabilty zones in a given region, you can't move them between regions. If you want backups of your EBS-backed volumes, by all means use a snapshot as part of your backup strategy, but then actually do a backup - either to S3 (we use duplicity) or to another machine in a different region (we back up to EBS-backed volumes in US-EAST). Don't be afraid of bandwidth charges - run the calculation on the AWS simple calculator - it's not as terrifying as you might have feared.

  • Consider not using EBS at all

In many cases, EBS volumes are not needed. Instance storage scales to 1.7TB, and although ephemeral, doesn't seem to have the kinds of problems many have been experiencing with EBS. If this fits your architecture, give it some thought.

Consider building towards a vendor-neutral architecture

We're big fans of AWS. But today raises questions about the wisdom of tying your infrastructure to one cloud provider. Heroku is an interesting example. Heroku's infrastructure piggy-backs on top of AWS, which meant that many applications were unavailable. Worse, access to the Heroku API was affected, and so users were stuck.

Architecting across multiple vendors is difficult, but not impossible. Cloud abstraction tools like Fog, and configuration management frameworks such as Chef make the task easier.

Patterns within the application architecture can also be used. If a decision has been made to make use of an AWS-specific tool or API, consider a writing lightweight wrapper around the AWS service, and try to build in and test an alternative provider's API, or your own implementation, or at least provide the capability of plugging one in. This prevents lock-in, and makes it much easier to deploy your systems to a different cloud should the requirement arise.

This said, I happen to hold to the view that for a smaller investment, if a client is already committed to using AWS, they can probably make use of Amazon's five regions, and design their systems around the ability to move between regions in the very rare case where multiple availability zones are impacted.

Have a DR plan, and practice it

Part of planning for failure is to know what to do when disaster strikes. When you've been paged at 3am and told that the whole site is down, and your hosting provider has no estimated time to recovery, the last thing you want to do is think. You should be on autopilot - everyone knows what to do, it's written down, it's been rehearsed, as much of it is automated as possible.

I encourage my engineers to write the plan down, somewhere accessible (and not only on the wiki that just went down). Have fire drills - pick a day, and run through the process of bringing up the DR systems, and recovering from backup. Follow the process - and improve it if you can.

Testing restores is the critical part of the process. Know how long it takes to restore your systems. If you have vast datasets that take hours to import, at least you know this in advance, and when and if you need to put the recovery plan into action, you can set expectations. Remember, though, your backups mean nothing if you haven't verified you can restore them. Make it a habit. When you need to do it for real, you'll be grateful you drilled yourself and your team.

Infrastructure as code is hugely relevant

One of the great enablers of the infrastructure as code paradigm is the ability to rebuild the business from nothing more than a source code repository, some new compute resource (virtual or physical) and an application data backup. In the case of multi-region failover, you might find that your strategy is to keep a database running, but deploy a stack, provisioned with your configuration management tool, on demand. We've tested this with cloud formation and chef and can bring up a simple site in five or ten minutes, and a multi-tier architecture with dozens of nodes within 30 minutes. The bottleneck is almost always the data restore - so work out ways to reduce the time taken to do this, and practice practice practice.

Many people reading this will be in a position where they already have an infrastructure in place that either isn't managed with a framework such as Chef, or is only partially built. If you take nothing else from today's issues, take an action to prioritise getting to the stage where you can rebuild your whole infrastructure from a git repo and a backup. The cloud is great for this - you can practice spinning your systems up in a different region, or a different zone, as many times as you like, until you're happy with it.

The cloud (and AWS) is still great

Sadly today has brought out the worst kinds of smugness and schadenfreude from people using other cloud providers, or traditional infrastructures. These people have very short memories. Joyent, Rackspace, Savvis, all these providers have had large and public outages. As we've already said, outages are part of life - get used to it.

Some commentators have suggested that AWS has inherent weaknesses by offering platform services beyond the basic resource provision that a simpler provider such as Linode offers. Linode is a great provider, and we've used them for year. However, I'm not sure it's as simple as that. If you've decided to deploy your application in the cloud, and you need flexible, scalable, persistent storage, or a highly available relational database, or an API-driven SMTP service, you have a choice. You can spend your time, and your developers' time, building your own, and making it enterprise ready, or you can trust some of the best architects in the world to build one for you. Sometimes making your own is a better choice, but you don't get it for free. You'll be paying more for the extra machines to support it, and the staff to administer it. Personally, I'm unconvinced that trying to build and manage these ancillary systems delivers value for the organisation.

Yes, today's outage is hugely visible. Yes it's had a massive impact on some businesses. That doesn't make the cloud bad, or dangerous. Quora, made a great point by serving a maintenance page with a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”

Using the cloud as part of your IT strategy is about much more than reliability. Not that EC2's reliability is bad - EC2 offers a 99.95% SLA. That's equivalent to the best managed hosting providers. The US East region that suffered so much today had a 100% record between 2009 and 2010. It should, of course, be noted that, strictly speaking, todays issues were with EBS, which doesn't attract an SLA. Be wary of SLAs and figures - they can be misleading.

Making use of the cloud is about flexibility and control and scalability. It's about a different way of thinking about provisioning infrastructure that encourages better business agility, and caters for unpredictable business growth. Yes you might get better availability from traditional hardware in a managed hosting facility, but even then outages happen, and more often than not these outages can take many hours to recover from.

The cloud is about being able to spin up complete systems in minutes. The cloud is about being able to triple the size of your infrastructure in days, when your product turns out to be much more popular than you imagined. Similarly, it's about being able to shrink to something tiny, and still survive, if you misjudge the market. The cloud is about the ability change how your infrastructure works, quickly, without worrying about sunk cost in switches or routers that you thought you might need. The cloud is about the ease with which we can provide a development environment that mirrors production, within 30 minutes, and then throw it away again. The cloud is about being able to add capacity for a big launch, and then take it away again with a mere API call. I could go on...

One, albeit major, outage in one region of one cloud vendor doesn't mean the cloud was a big con, a waste of time, a marketing person's wet dream. The emperor isn't naked, and the nay-sayers are simply enjoying their day of 'I told you so'. The cloud is here to stay, and brings with it huge benefits to the IT industry. However, it does require a different approach to building systems. The cloud is not dead - it's still great.

Summary

Today has been a tough day for business affected by the EC2 outage. We can take the following high level lessons away from today:

  • Expect, and design for downtime
  • Have a DR plan, and practice it until it's second nature
  • Make it your priority to build your infrastructure as code, and to be able to rebuild it from scratch, from nothing more than a source code repository and a backup
  • The cloud is still great

Today’s EC2 / EBS Outage: Lessons learned

Today Britain woke to the news that Amazon Web Services had suffered a major outage in its US East facility. This affected Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. The cause of the outage appears to have been a case of so-called ‘auto-immune disease’. Amazon’s automated processes began remirroring a large number of EBS volumes, which had a knock on effect of significantly degrading EBS (and thus RDS) performance and availability across multiple availability zones. Naturally the nay-sayers were out in force, decrying cloud-based architectures as doomed to failure from the very start. As the dust starts to settle, we attempt to distill some lessons from the outage.

Expect downtime

The first and most obvious point to make is that downtime is inevitable. Clouds fail. Datacenters fail. Disasters happen. The people trying to make some causal relationship between deploying to the cloud and general failure are missing the point.

What matters is how you respond to downtime. At Atalanta Systems we challenge our clients to switch off machines at random. If their architecture isn’t built to withstand failure, we’ve failed in helping them. Incidentally we’ve been doing this for years, long before anyone ever mentioned ‘chaos monkeys’.

Especially in a cloudy world, expect failure - EC2 instances can and will randomly crash. Expect this, and you won’t be disappointed. From day one, expect hardware problems, expect network problems, expect your availability zone to break.

Now, of course, there’s a big difference between switching off a few machines or pulling a few cables and losing a whole datacenter. However, we have to expect downtime, and we have to be ready for it. Here are a few suggestions:

Use amazon’s built-in availability mechanisms

Don’t treat AWS like a traditional datacenter. Amazon provides up to four availability zones per region, and a range of free and paid-for tools for using them. Techniques for taking advantage of these features range from as simple as using elastic IP addresses and remapping manually to a different zone, to using multi-availability zone RDS instances to replicate database updates across zones.

Make use of autoscaling groups, and deploy in more than two availability zones. Latency between zones is minimal, and autoscaling groups can span availability zones, and can be configured to trigger based on utilisation. People maintaining that it costs twice as much to run a highly available infrastructure in AWS simply haven’t read the documentation. Take care to avoid the classic fallacy of having three web servers at 60% utilisation, and one failing, resulting in two failing immediately afterwards.

Size your infrastructure to include headroom for load spikes, and to be able to sustain an complete AZ failure. For any business for whom downtime can be measured in tens of pounds per minute (which accounts for even small startups), it’s cheaper to build in the availability than to suffer the outage.

The problem with today’s outage is that it appears to have impacted multiple availability zones. The full explanation for this has not yet been forthcoming, but it does service to highlight that if availability really matters to you, you really need to consider using multiple regions. Amazon has points of presence on the East coast, the West coast, Western Europe, and two in South East Asia. Backing up to S3 from one region enables restore into another. Cloudwatch triggers can be used to launch new instances in a different region, or even a full stack via Cloud Formation. We have clients doing this on the East and West coast, without spending outrageous amounts of money.

The bottom line is that one of the key benefits of using AWS is the geographic spread it enables, together with its monitoring and scaling and balancing capabilities. Look into using these - if you’re not at least exploring these areas, you’re doing the equivalent of buying an iPhone and only ever using it for text messages.

Think about your use of EBS

It’s not the first time there have been problems with EBS - only last month, Reddit was down for most of the day because of EBS-related issues. Here are a few things to consider when thinking about using EBS in your setup:

  • EBS is not a SAN

EBS is network accessible, block storage. It’s more like a NetApp than a fibre-based storage array. Treat it as such. Don’t expect to be able to use EBS effectively if your network is saturated. Also be aware that EBS (and the whole of AWS) is built on commodity hardware, and as such is not going to behave in the same way as a NetApp. You’re going to struggle to get the kind of performance you’d get from a commercial SAN or NAS, with battery-backed cache, but EBS is considerably cheaper.

  • EBS is multi-tenant

Remember that you’re sharing disk space and IO with other people. Design with this in mind. Deploy large volumes, even if you don’t need the space, to minimise contention. Consider using lots of volumes and building up your own RAID 10 or RAID 6 from EBS volumes. Think of it as a way to get as many spindles as you can, spread across as many disk-providers as possible. Avoid wherever possible using a single EBS volume - as Reddit found to their cost last month, this is not the right way to use EBS.

  • Don’t use EBS snapshots as a backup

EBS snapshots are a very handy feature, but they are not backups. Although they are available to different availabilty zones in a given region, you can’t move them between regions. If you want backups of your EBS-backed volumes, by all means use a snapshot as part of your backup strategy, but then actually do a backup - either to S3 (we use duplicity) or to another machine in a different region (we back up to EBS-backed volumes in US-EAST). Don’t be afraid of bandwidth charges - run the calculation on the AWS simple calculator - it’s not as terrifying as you might have feared.

  • Consider not using EBS at all

In many cases, EBS volumes are not needed. Instance storage scales to 1.7TB, and although ephemeral, doesn’t seem to have the kinds of problems many have been experiencing with EBS. If this fits your architecture, give it some thought.

Consider building towards a vendor-neutral architecture

We’re big fans of AWS. But today raises questions about the wisdom of tying your infrastructure to one cloud provider. Heroku is an interesting example. Heroku’s infrastructure piggy-backs on top of AWS, which meant that many applications were unavailable. Worse, access to the Heroku API was affected, and so users were stuck.

Architecting across multiple vendors is difficult, but not impossible. Cloud abstraction tools like Fog, and configuration management frameworks such as Chef make the task easier.

Patterns within the application architecture can also be used. If a decision has been made to make use of an AWS-specific tool or API, consider a writing lightweight wrapper around the AWS service, and try to build in and test an alternative provider’s API, or your own implementation, or at least provide the capability of plugging one in. This prevents lock-in, and makes it much easier to deploy your systems to a different cloud should the requirement arise.

This said, I happen to hold to the view that for a smaller investment, if a client is already committed to using AWS, they can probably make use of Amazon’s five regions, and design their systems around the ability to move between regions in the very rare case where multiple availability zones are impacted.

Have a DR plan, and practice it

Part of planning for failure is to know what to do when disaster strikes. When you’ve been paged at 3am and told that the whole site is down, and your hosting provider has no estimated time to recovery, the last thing you want to do is think. You should be on autopilot - everyone knows what to do, it’s written down, it’s been rehearsed, as much of it is automated as possible.

I encourage my engineers to write the plan down, somewhere accessible (and not only on the wiki that just went down). Have fire drills - pick a day, and run through the process of bringing up the DR systems, and recovering from backup. Follow the process - and improve it if you can.

Testing restores is the critical part of the process. Know how long it takes to restore your systems. If you have vast datasets that take hours to import, at least you know this in advance, and when and if you need to put the recovery plan into action, you can set expectations. Remember, though, your backups mean nothing if you haven’t verified you can restore them. Make it a habit. When you need to do it for real, you’ll be grateful you drilled yourself and your team.

Infrastructure as code is hugely relevant

One of the great enablers of the infrastructure as code paradigm is the ability to rebuild the business from nothing more than a source code repository, some new compute resource (virtual or physical) and an application data backup. In the case of multi-region failover, you might find that your strategy is to keep a database running, but deploy a stack, provisioned with your configuration management tool, on demand. We’ve tested this with cloud formation and chef and can bring up a simple site in five or ten minutes, and a multi-tier architecture with dozens of nodes within 30 minutes. The bottleneck is almost always the data restore - so work out ways to reduce the time taken to do this, and practice practice practice.

Many people reading this will be in a position where they already have an infrastructure in place that either isn’t managed with a framework such as Chef, or is only partially built. If you take nothing else from today’s issues, take an action to prioritise getting to the stage where you can rebuild your whole infrastructure from a git repo and a backup. The cloud is great for this - you can practice spinning your systems up in a different region, or a different zone, as many times as you like, until you’re happy with it.

The cloud (and AWS) is still great

Sadly today has brought out the worst kinds of smugness and schadenfreude from people using other cloud providers, or traditional infrastructures. These people have very short memories. Joyent, Rackspace, Savvis, all these providers have had large and public outages. As we’ve already said, outages are part of life - get used to it.

Some commentators have suggested that AWS has inherent weaknesses by offering platform services beyond the basic resource provision that a simpler provider such as Linode offers. Linode is a great provider, and we’ve used them for year. However, I’m not sure it’s as simple as that. If you’ve decided to deploy your application in the cloud, and you need flexible, scalable, persistent storage, or a highly available relational database, or an API-driven SMTP service, you have a choice. You can spend your time, and your developers’ time, building your own, and making it enterprise ready, or you can trust some of the best architects in the world to build one for you. Sometimes making your own is a better choice, but you don’t get it for free. You’ll be paying more for the extra machines to support it, and the staff to administer it. Personally, I’m unconvinced that trying to build and manage these ancillary systems delivers value for the organisation.

Yes, today’s outage is hugely visible. Yes it’s had a massive impact on some businesses. That doesn’t make the cloud bad, or dangerous. Quora, made a great point by serving a maintenance page with a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”

Using the cloud as part of your IT strategy is about much more than reliability. Not that EC2’s reliability is bad - EC2 offers a 99.95% SLA. That’s equivalent to the best managed hosting providers. The US East region that suffered so much today had a 100% record between 2009 and 2010. It should, of course, be noted that, strictly speaking, todays issues were with EBS, which doesn’t attract an SLA. Be wary of SLAs and figures - they can be misleading.

Making use of the cloud is about flexibility and control and scalability. It’s about a different way of thinking about provisioning infrastructure that encourages better business agility, and caters for unpredictable business growth. Yes you might get better availability from traditional hardware in a managed hosting facility, but even then outages happen, and more often than not these outages can take many hours to recover from.

The cloud is about being able to spin up complete systems in minutes. The cloud is about being able to triple the size of your infrastructure in days, when your product turns out to be much more popular than you imagined. Similarly, it’s about being able to shrink to something tiny, and still survive, if you misjudge the market. The cloud is about the ability change how your infrastructure works, quickly, without worrying about sunk cost in switches or routers that you thought you might need. The cloud is about the ease with which we can provide a development environment that mirrors production, within 30 minutes, and then throw it away again. The cloud is about being able to add capacity for a big launch, and then take it away again with a mere API call. I could go on…

One, albeit major, outage in one region of one cloud vendor doesn’t mean the cloud was a big con, a waste of time, a marketing person’s wet dream. The emperor isn’t naked, and the nay-sayers are simply enjoying their day of ‘I told you so’. The cloud is here to stay, and brings with it huge benefits to the IT industry. However, it does require a different approach to building systems. The cloud is not dead - it’s still great.

Summary

Today has been a tough day for business affected by the EC2 outage. We can take the following high level lessons away from today:

  • Expect, and design for downtime
  • Have a DR plan, and practice it until it’s second nature
  • Make it your priority to build your infrastructure as code, and to be able to rebuild it from scratch, from nothing more than a source code repository and a backup
  • The cloud is still great

Today’s EC2 / EBS Outage: Lessons learned

Today Britain woke to the news that Amazon Web Services had suffered a major outage in its US East facility. This affected Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. The cause of the outage appears to have been a case of so-called ‘auto-immune disease’. Amazon’s automated processes began remirroring a large number of EBS volumes, which had a knock on effect of significantly degrading EBS (and thus RDS) performance and availability across multiple availability zones. Naturally the nay-sayers were out in force, decrying cloud-based architectures as doomed to failure from the very start. As the dust starts to settle, we attempt to distill some lessons from the outage.

Expect downtime

The first and most obvious point to make is that downtime is inevitable. Clouds fail. Datacenters fail. Disasters happen. The people trying to make some causal relationship between deploying to the cloud and general failure are missing the point.

What matters is how you respond to downtime. At Atalanta Systems we challenge our clients to switch off machines at random. If their architecture isn’t built to withstand failure, we’ve failed in helping them. Incidentally we’ve been doing this for years, long before anyone ever mentioned ‘chaos monkeys’.

Especially in a cloudy world, expect failure - EC2 instances can and will randomly crash. Expect this, and you won’t be disappointed. From day one, expect hardware problems, expect network problems, expect your availability zone to break.

Now, of course, there’s a big difference between switching off a few machines or pulling a few cables and losing a whole datacenter. However, we have to expect downtime, and we have to be ready for it. Here are a few suggestions:

Use amazon’s built-in availability mechanisms

Don’t treat AWS like a traditional datacenter. Amazon provides up to four availability zones per region, and a range of free and paid-for tools for using them. Techniques for taking advantage of these features range from as simple as using elastic IP addresses and remapping manually to a different zone, to using multi-availability zone RDS instances to replicate database updates across zones.

Make use of autoscaling groups, and deploy in more than two availability zones. Latency between zones is minimal, and autoscaling groups can span availability zones, and can be configured to trigger based on utilisation. People maintaining that it costs twice as much to run a highly available infrastructure in AWS simply haven’t read the documentation. Take care to avoid the classic fallacy of having three web servers at 60% utilisation, and one failing, resulting in two failing immediately afterwards.

Size your infrastructure to include headroom for load spikes, and to be able to sustain an complete AZ failure. For any business for whom downtime can be measured in tens of pounds per minute (which accounts for even small startups), it’s cheaper to build in the availability than to suffer the outage.

The problem with today’s outage is that it appears to have impacted multiple availability zones. The full explanation for this has not yet been forthcoming, but it does service to highlight that if availability really matters to you, you really need to consider using multiple regions. Amazon has points of presence on the East coast, the West coast, Western Europe, and two in South East Asia. Backing up to S3 from one region enables restore into another. Cloudwatch triggers can be used to launch new instances in a different region, or even a full stack via Cloud Formation. We have clients doing this on the East and West coast, without spending outrageous amounts of money.

The bottom line is that one of the key benefits of using AWS is the geographic spread it enables, together with its monitoring and scaling and balancing capabilities. Look into using these - if you’re not at least exploring these areas, you’re doing the equivalent of buying an iPhone and only ever using it for text messages.

Think about your use of EBS

It’s not the first time there have been problems with EBS - only last month, Reddit was down for most of the day because of EBS-related issues. Here are a few things to consider when thinking about using EBS in your setup:

  • EBS is not a SAN

EBS is network accessible, block storage. It’s more like a NetApp than a fibre-based storage array. Treat it as such. Don’t expect to be able to use EBS effectively if your network is saturated. Also be aware that EBS (and the whole of AWS) is built on commodity hardware, and as such is not going to behave in the same way as a NetApp. You’re going to struggle to get the kind of performance you’d get from a commercial SAN or NAS, with battery-backed cache, but EBS is considerably cheaper.

  • EBS is multi-tenant

Remember that you’re sharing disk space and IO with other people. Design with this in mind. Deploy large volumes, even if you don’t need the space, to minimise contention. Consider using lots of volumes and building up your own RAID 10 or RAID 6 from EBS volumes. Think of it as a way to get as many spindles as you can, spread across as many disk-providers as possible. Avoid wherever possible using a single EBS volume - as Reddit found to their cost last month, this is not the right way to use EBS.

  • Don’t use EBS snapshots as a backup

EBS snapshots are a very handy feature, but they are not backups. Although they are available to different availabilty zones in a given region, you can’t move them between regions. If you want backups of your EBS-backed volumes, by all means use a snapshot as part of your backup strategy, but then actually do a backup - either to S3 (we use duplicity) or to another machine in a different region (we back up to EBS-backed volumes in US-EAST). Don’t be afraid of bandwidth charges - run the calculation on the AWS simple calculator - it’s not as terrifying as you might have feared.

  • Consider not using EBS at all

In many cases, EBS volumes are not needed. Instance storage scales to 1.7TB, and although ephemeral, doesn’t seem to have the kinds of problems many have been experiencing with EBS. If this fits your architecture, give it some thought.

Consider building towards a vendor-neutral architecture

We’re big fans of AWS. But today raises questions about the wisdom of tying your infrastructure to one cloud provider. Heroku is an interesting example. Heroku’s infrastructure piggy-backs on top of AWS, which meant that many applications were unavailable. Worse, access to the Heroku API was affected, and so users were stuck.

Architecting across multiple vendors is difficult, but not impossible. Cloud abstraction tools like Fog, and configuration management frameworks such as Chef make the task easier.

Patterns within the application architecture can also be used. If a decision has been made to make use of an AWS-specific tool or API, consider a writing lightweight wrapper around the AWS service, and try to build in and test an alternative provider’s API, or your own implementation, or at least provide the capability of plugging one in. This prevents lock-in, and makes it much easier to deploy your systems to a different cloud should the requirement arise.

This said, I happen to hold to the view that for a smaller investment, if a client is already committed to using AWS, they can probably make use of Amazon’s five regions, and design their systems around the ability to move between regions in the very rare case where multiple availability zones are impacted.

Have a DR plan, and practice it

Part of planning for failure is to know what to do when disaster strikes. When you’ve been paged at 3am and told that the whole site is down, and your hosting provider has no estimated time to recovery, the last thing you want to do is think. You should be on autopilot - everyone knows what to do, it’s written down, it’s been rehearsed, as much of it is automated as possible.

I encourage my engineers to write the plan down, somewhere accessible (and not only on the wiki that just went down). Have fire drills - pick a day, and run through the process of bringing up the DR systems, and recovering from backup. Follow the process - and improve it if you can.

Testing restores is the critical part of the process. Know how long it takes to restore your systems. If you have vast datasets that take hours to import, at least you know this in advance, and when and if you need to put the recovery plan into action, you can set expectations. Remember, though, your backups mean nothing if you haven’t verified you can restore them. Make it a habit. When you need to do it for real, you’ll be grateful you drilled yourself and your team.

Infrastructure as code is hugely relevant

One of the great enablers of the infrastructure as code paradigm is the ability to rebuild the business from nothing more than a source code repository, some new compute resource (virtual or physical) and an application data backup. In the case of multi-region failover, you might find that your strategy is to keep a database running, but deploy a stack, provisioned with your configuration management tool, on demand. We’ve tested this with cloud formation and chef and can bring up a simple site in five or ten minutes, and a multi-tier architecture with dozens of nodes within 30 minutes. The bottleneck is almost always the data restore - so work out ways to reduce the time taken to do this, and practice practice practice.

Many people reading this will be in a position where they already have an infrastructure in place that either isn’t managed with a framework such as Chef, or is only partially built. If you take nothing else from today’s issues, take an action to prioritise getting to the stage where you can rebuild your whole infrastructure from a git repo and a backup. The cloud is great for this - you can practice spinning your systems up in a different region, or a different zone, as many times as you like, until you’re happy with it.

The cloud (and AWS) is still great

Sadly today has brought out the worst kinds of smugness and schadenfreude from people using other cloud providers, or traditional infrastructures. These people have very short memories. Joyent, Rackspace, Savvis, all these providers have had large and public outages. As we’ve already said, outages are part of life - get used to it.

Some commentators have suggested that AWS has inherent weaknesses by offering platform services beyond the basic resource provision that a simpler provider such as Linode offers. Linode is a great provider, and we’ve used them for year. However, I’m not sure it’s as simple as that. If you’ve decided to deploy your application in the cloud, and you need flexible, scalable, persistent storage, or a highly available relational database, or an API-driven SMTP service, you have a choice. You can spend your time, and your developers’ time, building your own, and making it enterprise ready, or you can trust some of the best architects in the world to build one for you. Sometimes making your own is a better choice, but you don’t get it for free. You’ll be paying more for the extra machines to support it, and the staff to administer it. Personally, I’m unconvinced that trying to build and manage these ancillary systems delivers value for the organisation.

Yes, today’s outage is hugely visible. Yes it’s had a massive impact on some businesses. That doesn’t make the cloud bad, or dangerous. Quora, made a great point by serving a maintenance page with a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”

Using the cloud as part of your IT strategy is about much more than reliability. Not that EC2’s reliability is bad - EC2 offers a 99.95% SLA. That’s equivalent to the best managed hosting providers. The US East region that suffered so much today had a 100% record between 2009 and 2010. It should, of course, be noted that, strictly speaking, todays issues were with EBS, which doesn’t attract an SLA. Be wary of SLAs and figures - they can be misleading.

Making use of the cloud is about flexibility and control and scalability. It’s about a different way of thinking about provisioning infrastructure that encourages better business agility, and caters for unpredictable business growth. Yes you might get better availability from traditional hardware in a managed hosting facility, but even then outages happen, and more often than not these outages can take many hours to recover from.

The cloud is about being able to spin up complete systems in minutes. The cloud is about being able to triple the size of your infrastructure in days, when your product turns out to be much more popular than you imagined. Similarly, it’s about being able to shrink to something tiny, and still survive, if you misjudge the market. The cloud is about the ability change how your infrastructure works, quickly, without worrying about sunk cost in switches or routers that you thought you might need. The cloud is about the ease with which we can provide a development environment that mirrors production, within 30 minutes, and then throw it away again. The cloud is about being able to add capacity for a big launch, and then take it away again with a mere API call. I could go on…

One, albeit major, outage in one region of one cloud vendor doesn’t mean the cloud was a big con, a waste of time, a marketing person’s wet dream. The emperor isn’t naked, and the nay-sayers are simply enjoying their day of ‘I told you so’. The cloud is here to stay, and brings with it huge benefits to the IT industry. However, it does require a different approach to building systems. The cloud is not dead - it’s still great.

Summary

Today has been a tough day for business affected by the EC2 outage. We can take the following high level lessons away from today:

  • Expect, and design for downtime
  • Have a DR plan, and practice it until it’s second nature
  • Make it your priority to build your infrastructure as code, and to be able to rebuild it from scratch, from nothing more than a source code repository and a backup
  • The cloud is still great