Category → HighLevel
By James Turnbull (@kartar), VP of Technology Operations, Puppet Labs Inc.
A sysadmin’s time is too valuable to waste resolving conflicts between operations and development teams, working through problems that stronger collaboration would solve, or performing routine tasks that can – and should – be automated. Working more collaboratively and freed from repetitive tasks, IT can – and will – play a strategic role in any business.
At Puppet Labs, we believe DevOps is the right approach for solving some of the cultural and operational challenges many IT organizations face. But without empirical data, a lot of the evidence for DevOps success has been anecdotal.
To find out whether DevOps-attuned organizations really do get better results, Puppet Labs partnered with IT Revolution Press IT Revolution Press to survey a broad spectrum of IT operations people, software developers and QA engineers.
The data gathered in the 2013 State of DevOps Report proves that DevOps concepts can make companies of any size more agile and more effective. We also found that the longer a team has embraced DevOps, the better the results. That success – along with growing awareness of DevOps – is driving faster adoption of DevOps concepts.
DevOps is everywhere
Our survey tapped just over 4,000 people living in approximately 90 countries. They work for a wide variety of organizations: startups, small to medium-sized companies, and huge corporations.
Most of our survey respondents – about 80 percent – are hands-on: sysadmins, developers or engineers. Break this down further, and we see more than 70 percent of these hands-on folks are actually in IT Ops, with the other 30 percent in development and engineering.
DevOps orgs ship faster, with fewer failures
DevOps ideas enable IT and engineering to move much faster than teams working in more traditional ways. Survey results showed:
- More frequent and faster deployments. High-performing organizations deploy code 30 times faster than their peers. Rather than deploying every week or month, these organizations deploy multiple times per day. Change lead time is much shorter, too. Rather than requiring lead time of weeks or months, teams that embrace DevOps can go from change order to deploy in just a few minutes. That means deployments can be completed up to 8,000 times faster.
- Far fewer outages. Change failure drops by 50 percent, and service is restored 12 times faster.
Organizations that have been working with DevOps the longest report the most frequent deployments, with the highest success rates. To cite just a few high-profile examples, Google, Amazon, Twitter and Etsy are all known for deploying frequently, without disrupting service to their customers.
Version control + automated code deployment = higher productivity, lower costs & quicker wins
Survey respondents who reported the highest levels of performance rely on version control and automation:
- 89 percent use version control systems for infrastructure management
- 82 percent automate their code deployments
Version control allows you to quickly pinpoint the cause of failures and resolve issues fast. Automating your code deployment eliminates configuration drift as you change environments. You save time and reduce errors by replacing manual workflows with a consistent and repeatable process. Management can rely on that consistency, and you free your technical teams to work on the innovations that give your company its competitive edge.
What are DevOps skills?
More recruiters are including the term DevOps in job descriptions. We found a 75 percent uptick in the 12-month period from January 2012 to January 2013. Mentions of DevOps as a job skill increased 50 percent during the same period.
In order of importance, here are the skills associated with DevOps:
- Coding & scripting. Demonstrates the increasing importance of traditional developer skills to IT operations.
- People skills. Acknowledges the importance of communication and collaboration in DevOps environments.
- Process re-engineering skills. Reflects the holistic view of IT and development as a single system, rather than as two different functions.
Interestingly, experience with specific tools was the lowest priority when seeking people for DevOps teams. This makes sense to us: It’s easier for people to learn new tools than to acquire the other skills.
It makes sense on a business level, too. After all, the tools a business needs will change as technology, markets and the business itself shift and evolve. What doesn’t change, however, is the need for agility, collaboration and creativity in the face of new business challenges.
About the author:
A former IT executive in the banking industry and author of five technology books, James has been involved in IT Operations for 20 years and is an advocate of open source technology. He joined Puppet Labs in March 2010.
So what does DevOps mean exactly? What is the Dev and what is the Ops in DevOps? The role of Operations can mean a lot of things and even different things to different people. DevOps is becoming more and more popular but a lot of people are confused on the topic of who does what. So let’s make a list of the responsibilities operations traditionally has and then figure out what developers should be doing, and which if any responsibilities should be shared.
- IT buying
- Installation of server hardware and OS
- Configuration of servers, networks, storage, etc…
- Monitoring of servers
- Respond to outages
- IT security
- Managing phone systems, network
- Change control
- Backup and disaster recovery planning
- Manage active directory
- Asset tracking
Shared Development & Operations duties
- Software deployments
- Application support
Some of these traditional responsibilities have changed in the last few years. Virtualization and the cloud have greatly simplified buying decisions, installation, and configuration. For example, nobody cares what kind of server we are going to buy anymore for a specific application or project. We buy great big ones, virtualize them, and just carve out what we need and change it on the fly. Cloud hosting simplifies this even more by eliminating the need to buy servers at all.
So what part of the “Ops” duties should developers be responsible for?
- Be involved in selecting the application stack
- Configure and deploy virtual or cloud servers (potentially)
- Deploy their applications
- Monitor application and system health
- Respond to applications problems as they arise.
Developers who take ownership of these responsibilities can ultimately deploy and support their applications more rapidly. DevOps processes and tools eliminate the walls between the teams and enables more agility for the business. This philosophy can enable the developers to potentially be responsible for the enter application stack from OS level and up in more a self service mode.
So what does the operations team do then?
- Manage the hardware infrastructure
- Configure and monitor networking
- Enforce policies around backup, DR, security, compliance, change control, etc
- Assist in monitoring the systems
- Manage active directory
- Asset tracking
- Other non production application related tasks
Depending on the company size the workload of these tasks will vary greatly. In large enterprise companies these operations tasks become complex enough to require specialization and dedicated personnel for these responsibilities. For small to midsize companies the IT manager and 1-2 system administrators can typically handle these tasks.
DevOps is evolving into letting the operations team focus on the infrastructure and IT policies while empowering the developers to exercise tremendous ownership from the OS level and up. With a solid infrastructure developers can own the application stack, build it, deploy it, and cover much if not all of its support. This enables development teams to be more self-service and independent of a busy centralized operations team. DevOps enables more agility, better efficiency, and ultimately a higher level of service to their customers.
About the author: Matt Watson is the Founder & CEO of Stackify. He has a lot of experience managing high growth and complex technology projects. He is focused on changing the way developers support their production applications with DevOps.
A discussion of process-based, package-based, declarative, imperative and generic approaches to application release automation.
Application Release Automation is a relatively new, but rapidly maturing area of IT. As with all new areas there is plenty of confusion around what Application Release Automation really is and the best way to go about it. There are those who come at it with a very developer-centric mind-set, there are those who embrace the modern DevOps concept and even those who attempt to apply server based automation tools to the application space.
Having worked with many companies of various sizes, technologies, cultures and mind-sets; both as they select an ARA (Application Release Automation) tool and as they move on to implement their chosen tool, I have had many opportunities to assess the various approaches. In this short blog I will discuss the pro’s and con’s of each approach.
Package-based automation is a technique that was originally designed for automating the server layer. Due to its success at this, some have attempted to adapt it to automate the application layer as well. Packages encapsulate all the changes that need to be performed on a single server, and can include the pre-requisite checks that need to take place, as well as the post-deployment verifications. When patching a server this makes complete sense, there are no dependencies between the patched server and all the others in the same data centre, and so applying all the required changes for that patch (or patches) in a bundle in one go is possible. The package can then be applied to all appropriate servers without modification. At this layer there is little difference between one Windows Server 2008 and the next, even though the applications on top may be completely different.
The benefit of this packaging approach is the easy rollback capability. If required the package can be easily rolled back to the original server state, but on the other hand it treats each server as an island with no dependency to another server. It assumes that all changes on that server can be done in one go. This type of automation is offered by companies like BMC Bladelogic and IBM Tivoli Provisioning Manager (TPM).
Declarative-based automation comes from a similar mindset to package-based but takes a different route to the solution. It also originally came out of the need to automate the server layer and a subsequent attempt to apply it to the application layer. With declarative-based automation, the desired state of the server is defined down to every individual configuration item (registry key, dll, config file entry) etc. Most declarative-based tools require you to describe the desired state by writing what is effectively a piece of ‘code’. Some solutions, for example Puppet, offer a simplified proprietary DSL (Domain Specific Language) but this does not allow you to do everything, and so keeps Ruby as a backup. The downside of this is that the user has to learn at least one programming language (or you have to employ people with that knowledge already) and so does not readily open the automation to non-developers. This approach also has the same downside as package-based automation in that it assumes each server can be configured independently and all in one go. But it also has the same benefits, in that automatic rollback is conceptually a lot easier.
Imperative-based automation is more familiar as the structure of the language is closer to traditional programming languages (such as Java, C++, Perl etc). In this approach a programming language is used to describe what needs to be done to the target servers in a series of steps executed in a specific sequence. Chef is an example of an imperative automation tool (the programming language which is based on Ruby). As with declarative-based, the code created (or recipe as it is called in Chef) is still very much focused on making changes to a single isolated server, and the assumption is that those same changes will be applied to multiple servers of the same type. There is limited understanding of making dependant changes across multiple servers, because that was not required at the server layer. It is only important when you move up to the application layer. And of course, the current offerings available still require you to be familiar with, or learn, a programming language to use them.
Generic and Custom-Built Approaches
Often, people try to apply generic approaches (such as Powershell, DOS batch scripts, Perl etc) to the task of automation, or even to write their own automation tool using a compiled language such as Java, C++ etc. As any developer will tell you, they can go and write something that will deploy your application. They can use their preferred language rather than having to use the language supported by the automation platform being employed. And they are right, a development team can indeed write a fully capable deployment tool but the question is: does it really benefit the company to take up development time building and maintaining a deployment tool rather than focusing on the development of their own applications? Even then they will have many issues to face in enabling parallel execution, reporting and auditing, access and permissions control, and importantly synchronising activity across multiple servers. The original intention of these approaches was once again focused on a specific server and not on the cross-server nature of application deployment.
Process-based automation is a different approach, which was created more recently, to address the needs of application release automation. ARA platforms such as DeployIt and UrbanDeploy and, of course, our own tool Nolio all take this approach. These tools seek to support currently existing application deployment processes, the ones that operators could/would normally step through manually. The focus is on processes, and the tools allow an operator to define them in a visual way, with an understanding of the cross-server nature of application deployments. Let’s say that you need to do something on an application’s web server, then the application server, then update tables in the database, then a second change to the application server and finally another change to the web server. With a server-centric approach (such as those discussed above) it is very hard to orchestrate activities across multiple different servers, to synchronise the changes so they happen at the right time in relation to each other, and even to pass information between those servers. With a process-based system this is straightforward – you define the different server types, drop the relevant steps onto each one and draw links to define the synchronisation points (i.e. only do the db steps once the application server steps have been completed).
Diagram 1 – Screenshot of a Nolio deployment process, including activity on 4 different server types.
The issues mentioned above such as parallel execution, cross-server synchronisation etc. should already have been dealt with by the ARA platform, rather than having to be created on top of a more generic platform. As you see from the diagram above, with an ARA platform cross-server synchronisation is simply a case of drawing a dependency link between an action on one server type to an action on another server type. Now “Stop Application” action on the “Application Server Type” will not run until “Prepare for package distribution” on “Repository node” has been completed.
In addition to this, process-based automation brings the following benefits: there is no need to modify the deployment process to fit the automation tools’ inability to synchronise activity across multiple servers, thus ensuring the consistency between manual and automated approaches is maintained; the defined process can be used as documentation for how the deployment should be done as it is inherently very readable; there is a detailed and relevant audit trail because the process is inline with how the deployment would be done manually, and it is much easier to diagnose issues because the process follows the expected steps.
You can read more articles on Application Release Automation on Nolio’s Blog Site.
This post is based on a new talk by @jesserobbins at QConSF 2012 (slides). Jesse is a firefighter, the former Master of Disaster at Amazon, and the Founding CEO of Opscode, the company behind Chef.
photo credit: John Keatley
Operations at web scale is the ability to consistently create and deploy reliable software to an unreliable platform that scales horizontally. Jesse created the Velocity conference to explore how to do this, learning from companies that do it well. Google, Amazon, Microsoft, Yahoo built their own automation and deployment tools. When Jesse left Amazon he was stunned at the lack of mature tooling elsewhere. Many companies considered their tools to be “secret sauce” that gave them a competitive advantage. Opscode was founded to provide Cloud infrastructure automation. Jesse’s experience helping other companies down this road led to a set of culture hacks that will help you adopt Continuous Delivery.
Continuous Delivery is the end state of thinking and approaching a wide array of problems in a new way. Big changes to software systems that build up over long periods of time suck. A long time and lots of code changes lead to breakage that is hard to solve. The Continuous Deployment way means small amounts of code deployed frequently. Awesome in theory, but it requires organizational change. The effort is worth it however as the benefits include faster time to value, higher availability, happier teams and more cool stuff. Given this, it is surprising that Continuous Delivery has taken so long to be accepted.
Teams that do Continuous Delivery are much happier. Seeing your code live is very gratifying. You have the freedom to experiment with new things because you aren’t stuck dealing with large releases and the challenge of getting everything right in one go.
Learning about Continuous Delivery is very exciting, but the reality is that back at the office things are challenging. Organizational change is hard. Let’s consider a roadmap for cultural change. The first problem is “it worked fine in test, it’s Ops’ problem now.”
Ops likes to punish dev for this.
Tools are not enough (even really great tools like Chef!). In order to succeed you have to convince people that you can be trusted and you want to work together. The reason for this is understood, for example see Conway’s law. Teams need to work together continuously, not just at deploy time.
Choice: discourage change in the interest of stability, or allow change to happen as often as it needs to. Asking the question of which do you choose is better than just making a statement.
Common Attributes of Web Scale Cultures
- Infrastructure as Code. This is the most important entry point, providing full-stack automation. Commodity hardware can be used with this approach, as reliability is provided in the software stack. Datacenters must have APIs; you can’t rely on humans to take action. All services including things like DNS have to follow this model. Infrastructure becomes a product, and the app dev team is the customer.
- Applications as Services. This means SOA with things like loose coupling and versioned APIs. You must also design for failure, and this is where a lot of teams struggle. Database/storage abstraction is important as well. Complexity is pushed up the stack. Deep instrumentation is critical for both infrastructure and apps.
- Dev / Ops as Teams. Shared metrics and monitoring, incident management. Sometimes it is good to rotate devs through the on-call duties so everyone gets experience. Tight integration means a set of tools that integrates tightly with all of the teams. This leads to Continuous Integration, which leads to Continuous Delivery. The Site Reliability Engineer role is important in this model so you have people that understand the system from top to bottom. Finally, thorough testing is important e.g. GameDay.
None of this is new; consider Theory of Constraints, Lean/JIT, Six Sigma, Toyota Production System, Agile, etc. You need to recognize it has to be a cultural change to make it work however. Every org will say “we can’t do it that way because…” They’re trying to think about where they are and extrapolate to this new state. It’s like an elephant (Enterprises) trying to fly. You have to give them a way to think about a way of making incremental evolutionary changes toward the goal.
Cultural change takes a long time. This is the hardest thing. Jesse’s Rule: Don’t Fight Stupid, Make More Awesome! Pick your battles and do these 5 things:
- Start small and built on trust and safety. The machinery will resist you if you try sweeping change.
- Create champions. Attack the least contentious thing first.
- Use metrics to build confidence. Create something that you can point to to get people excited. Time to value is a good one.
- Celebrate successes. This builds excitement, even for trivial accomplishments. The thing is to create arbitray points where you can look back and see progress.
- Exploit Compelling Events. When something breaks it is a chance to do something different. “Currency to Make Change” is made available, as John Allspaw puts it.
- Small change isn’t a threat and it’s easy to ignore. Too big of a change will meet resistence, so start small.
- Just call it an experiment. Don’t present the change as an all or nothing commitment.
- Get executive sponsors, starting with your boss
- Give everyone else the credit. When people around you succeed, celebrate it.
- Give “Special Status”. This is magic. Special badges, SRE bomber jackets at Google… these things are cool and you’re giving people something they want.
- Have people with “Special Status” talk about the new awesome. Make them evangelists and create mentor programs to build an internal structure of advocates.
- Find KPIs that support change. Hacking metrics is important to drive change. Having KPIs around things like time to value is compelling. Relate shipping code to revenue.
- Track and use KPIs ruthlessly. First you show value, then you show the cost of not making the change by laggards. This is the carrot and stick approach.
- Tell your story with data. Hans Rosling has a great TED talk on this topic. This is the most powerful hack. Include stories about what your competitors are doing. There’s no other way to make this work.
- Tell a powerful story
- Always be positive about people and how they overcame a problem. This is especially important with Ops people who tend to be grumpy.
- Never focus on the people who created the problem. Focus instead on the problem itself.
- Leave room for people to come to your side. Otherwise you’ll make enemies. Don’t fight stupid.
- Just wait, one will come. Things are never stable. Exploit challenges like compliance or moving to Cloud.
- Don’t say “I told you so”, instead ask “what do we do now?” Make it safe for people to decide to change.
Remember, don’t fight stupid, make more awesome!
During this talk, Chris Pinkham (former VP of IT Infrastructure for Amazon and current CEO of Silicon Valley startup Nimbula) shared his thoughts on the evolution of cloud computing and how its growth is changing the way we think about technology, infrastructure, and business. Chris is one of the world’s leading experts on Cloud Computing and is largely credited with bringing Amazon’s Elastic Compute Cloud (EC2) to life during his tenure at Amazon. Many experts consider EC2 as the largest public cloud on the planet. It runs on an estimated 450,000 servers and hosts notable customers such as Reddit, Quora, Netflix, foursquare, and iSeatz. Last year, EC2 was believed to be responsible for generating an estimated +$1.2 Billion in revenue for Amazon.
Chris Pinkham was born in Singapore, raised and educated in Britain and South Africa. Chris has co-authored a couple of patent applications: “Managing Communications Between Computing Nodes”, “Managing Execution of Programs by Multiple Computing Systems” Chris created and ran the first commercial ISP in South Africa, Internet Africa, which he sold to UUNET in 1996. The company, now owned by MTN, remains one of the largest ISPs on the African continent. Later, Chris joined Amazon.com as Vice President, IT Infrastructure where he was responsible for the company’s global infrastructure engineering and operations. While in this role, he conceived, proposed and, together with Willem Van Biljon, built Amazon’s Elastic Compute Cloud (EC2), the highly successful public cloud service.
In 2006, Chris left Amazon Web Services and has since started a new venture with his long time friend Willem. The company, Nimbula, is focused on Cloud Computing software and is funded by Sequoia Capital and Accel Partners.
In the world of DevOps, there are a few guys who need no introduction. One of them is @botchagalupe. Instead of live blogging a talk he did today at Build a Cloud Day Chicago, I thought I’d just post the video here. Enjoy!
Puppet Labs is winding down the whirlwind tour of Puppet Camps for the first half 2012. Prior to 2012, the most camps we had ever done in a year was two. This year, we’ve had 12 already, with roughly 100 people at each of them. The last one of this leg is in the Windy City, Chicago, IL.
Puppet Camps have evolved from the main annual get-together for the Puppet faithful to regional events with hundreds of people at each. We’ve had amazing talks on scaling puppet, puppet in the cloud, launched PuppetDB, discussed roadmaps, and of course debated practices surrounding operations in countless ways.
At Boston camp last month, we had talks on Cloud usage with Puppet, cross-host configuration modelling prototypes, hiring in a DevOps world, using Puppet with Jenkins to deploy large developer workloads and a bunch of hallway time to answer questions. Of course there were some after-hours activities as well.
Chicago is proving no different. In this case, we have some interesting things happening. I’ll be there to give a ‘lay of the land’ talk about what’s happening inside Puppet Labs from a development perspective and what the community has been excited and concerned about. When I talk, I’m a ‘no question off limits’ kind of speaker, so the talks sometimes take a left turn here and there, but I tend to enjoy that.
Giving his first presentation at a Puppet Camp this year, is Ryan Coleman, our newest Product Owner at Puppet Labs. He’s in charge of the product and experience with Puppet Modules and our Forge. Ryan has a long road ahead of him for improvement, but is filled with passion and ideas. You may get to have real impact on the direction in this space just spending 5 minutes with him. Ryan came to this role from Professional Services where he worked with customers on Puppet installation, taught classes and wrote educational materials.
As we head into PuppetConf (our big conference Sep 27&28 in SFO), you might get some preview material at the Chicago Puppet Camp. We look forward to seeing you there!
Time to throw some more buzzwords at you. Nothing makes peoples eyes roll back more quickly than saying DevOps and Cloud in the same sentence. This is going to be all about Cloud and DevOps.
The theory of DevOps is predicated on the idea that all elements of a technology infrastructure can be controlled through code. Without cloud that can’t be entirely true. Someone has to back the servers into the data center and so on. In pure cloud operations we get to this sort of nirvana where everything relating to technology is controllable purely through code.
A key thing with respect to DevOps plus Cloud that follows from the statements made above is that everything becomes repeatable. At some level that is the point of DevOps. You take the repeatable elements of Dev and apply them to Ops. Starting a server becomes a repeatable testable process.
Scalability; this is another thing you get with DevOps + Cloud. DevOpsCloud allows you to increase the server to admin ratio. No longer is provisioning a server kept as a long set of steps stored in a black binder somewhere. It is kept as actual software. This reduces errors tremendously.
DevOps + Cloud is self healing. What I mean by that is that when you get to a scenario where you entire infrastructure is governed by code, that code can become aware of anything that is going wrong within its domain. You can detect VM failures and automagically bring up a replacement VM for example. You know that the replacement is going to work the way you designed it. No more 3:00 AM pager alerts for many of your VM failures because this is all handled automatically. Some organizations even go so far as to have “chaos monkeys” these folks are paid to wreak havoc just to ensure/prove that the system is self healing.
Continuous integration and deployment. This means that your environments are never static. There is no fixed production environment with change management processes that govern how you get code into that environment. The walls between dev and ops fall down. We are running CI and delivering code continuously not just at the application level but at the infrastructure level when you fully go DevOps and Cloud.
DevOps needs these buzzwords. The reason it does is because DevOps only succeeds in so far as you are able to manage things through code. Anywhere you drop back to human behavior you are getting away from the value proposition that is represented by DevOps. What cloud and NoSQL bring into the mix is this ability, or ease, to treat all elements of your infrastructure as a programmable component that can be managed through DevOps processes.
I will start off with NoSQL. I am not saying that you can’t automate RDBMS systems, it is just that it is quite a bit easier to automate NoSQL systems because of their properties. It will make more sense in a bit. Let’s talk CAP theorem. CAP theorem is an over arching theorem about how you can manage consistency, availability and partition tolerance.
Consistency – how consistent is your data for all readers in a system. Can one reader see a different value than another at times – for how long?
Availability – how tolerant is the entire system of a failure of a single node
Partition tolerance – a system that continuous to operate despite message loss between partitions.
Now all that was an oversimplification but lets just go with it for now. Cap theorem says that you can’t have more than 2 of the aforementioned properties at once. Relational DBs are essentially focused on consistency. It is very hard [impossible] to have a relational system that has nodes in Chicago and London and has nodes that are perfectly available and consistent. These distributed setups present some of the largest failure scenarios we see in traditional systems. These failures are the types of things we would want to handle automatically with a self healing system. Due to the properties of RDBMS systems, as we will see, this can be very difficult.
The deployment of a relational DB is fairly easy. This can be automated well enough. What gets hard is if you want to scale your reads based on autoscaling metrics. Lets say I have my reads spread across my slaves and I want the number of slaves to go up and down based on demand. The problem here is that each slave brought up needs to pull a ton of data in order to meet the consistency requirements of the system. This is not very efficient.
NoSQL systems go for something called eventual consistency. Most data that you are dealing with on a day to day basis can be eventually consistent. If, for example, if I update my Facebook status, delete it, and then add another it is ok if some people saw the first update and others only saw the second. Lots of data is this way. If you rely on eventual consistency things become a lot easier.
NoSQL systems by design can deal with node failures. In the SQL side if the master fails your application can’t do any writes until you solve the problem by bringing up a new master or promoting a slave. Many NoSQL systems have a peer to peer based relationship between its nodes. This lack of differentiation makes the system much more resistent to failure and much easier to reason about. Automating the recovery of these NoSQL systems is far easier than the RDBMS as you can imagine. At the end of the day, tools with these properties, being easy to reason about and being simple to automate are exactly the types of tools that we should prefer in our CloudDevOps type environments.
Cloud is, for the purposes of this discussion, is sort of the pinnacle of SOA in that it makes everything controllable through an API. If it has no API it is not Cloud. If you buy into this then you agree that everything that is Cloud is ultimately programmable.
Virtualization is the foundation of Cloud but virtualization is not Cloud by itself. It certainly enables many of the things we talk about when we talk Cloud but it is not necessary sufficient to be a cloud. Google app engine is a cloud that does not incorporate virtualization. One of the reasons that virtualization is great is because you can automate the procurement of new boxes.
The Cloud platform has 2 key components to it that turn virtualization into Cloud. One of them is locational transparency. When you go into a vSphere console you are very aware of where the VM sits. With a cloud platform you essentially stop caring about that stuff. You don’t have to care anymore about where things lie which means that topology changes become much easier to handle in the scaling case or failure cases. Programming languages like Erlang have made heavy use of this property for years and have proven that this property is highly effective. Let’s talk now about configuration management.
Configuration management is one of the most fundamental elements allowing DevOps in the cloud. It allows you to have different VMs that have just enough OS that they can be provisioned, automatically through virtualization, and then through configuration management can be assigned to a distinct purpose within the cloud. The CM system handles turning the lightly provisioned VM into the type of server that it is intended to be. Orchestration is also a key part of management.
Orchestration is the understanding of the application level view of the infrastructure. It is knowing which apps need to communicate and interact in what ways. The orchestration system armed with this knowledge can then provision and manage nodes in a way consistent with that knowledge. This part of the system also does policy enforcement to please all the governance folks. Moving to an even higher level of abstraction and opinionatedness we will talk about Platform Clouds (PaaS) in the next section.
Platform Clouds (PaaS)
Platform clouds are a bit different. They are not VM focused but instead focus on providing resources and containers for automated scalable applications. Some examples of the type of resources that are provided by a PaaS system are database platforms including both RDBMS and NoSQL resource types. The platform allows you to spin up instances of these sorts of resources through an API on demand. Amazon SimpleDB is an example of this. Messaging components, things that manage communication between components are yet another example of the type of service that these sorts of systems provide. The management of these platforms and the resources provisioned within them is handled by the cloud orchestration and management systems. The systems are also highly effective at providing containers for resources that are user created.
The real power here is that you can package up your application, say a war file or some ruby on rails app, and then just hand it off to the cloud which has specific containers for encapsulating your code and managing things like security, fault tolerance, and scalability. You don’t have to think about it.
One caveat to be aware of though is that you can run into vendor lock in. Moving from one of these platforms, should you rely heavily on its services and resources, can be very difficult and require a lot of refactoring.
Cloud and DevOps
Cloud with your DevOps offers some fantastic properties. The ability to leverage all the advancements made in software development around repeatability and testability with your infrastructure. The ability to scale up as need be real time (autoscaling) and among other things being able to harness the power of self healing systems. DevOps better with Cloud.
Email: george.reese at enstratus dot com
This post was live blogged by @martinjlogan so expect errors.
This talk is about how to overcome organizational hurdles and get DevOps humming in your org. This illustrates how we did it at DRW Trading.
DRW needed to adjust. The problem was that we are not exposing people to problems upfront. Everyone was only exposed to their local problems and only optimized locally. We looked and continue to look at DevOps as our tool to change this.
[Seth is talking a bit about the lessons that were learned at DRW that can really be applied at all levels in the org.]
The first ting you need to do if you are introducing DevOps to your org is define what DevOps is do you. Gartner has an interesting definition, not sure if it reflects our opinions, but at least they are trying to figure it out. At DRW we use the words “agile operations” and DevOps interchangeably. We are integrating IT operations with agile and lean principles. Fast iterative work, embedding people on teams and moving people as close to the value they are delivering as possible. DevOps is not a job, it is a way of working. You can have people in embedded positions using these practices as easily as you can for folks in shared teams.
The next thing you need to do is focus on the problem that you are trying to solve. This is obvious but not all that simple. Here is an example. We had a complaint from our high frequency trading folks last year saying that servers were not available fast enough. It took on average 35 days for us to get a server purchased and ready to run. Dan North and I were reading the book “The Goal” – a book I highly recommend. It is a really good read. In the book he talks about the theory of constraints and applying lean principles to repeatable process. We used a technique called value stream mapping to our server delivery process. People complained that I [Seth] was a bottleneck becuase I had to approve all server purchases. Turned out I only take 2 hours to do that. The real problem laid elsewhere. The value stream mapping allowed us to see where our bottlenecks were so that we could focus in on our real bottlenecks and not waste cycles on less productive areas. We zeroed in accurately and reduced the time from 35 to 12 days.
The third cultural lesson, and an important one, is keep your specialists. One of the worst things that can happen is that you introduced a lot of general operators and then the network team, for example, says wow, you totally devalued me, and they quit. You lose a lot of expertise that it turns out is quite useful this way. Keep your specialists in the center. You want to highlight the tough problems to the specialists and leverage them for solving those problems. Introducing DevOps can actually open the floodgates for more work for the people in the center. We endeavored to distribute unix system management to reduce the amount of work for the Unix team itself. This got people all across the org a bit closer to what was going on in this domain. What actually happened is that the Unix team was hit harder than ever. As we got people closer to the problem the demand that we had not seen or been able to notice previously increased quite a bit. This is a good problem to have because you start to understand more of what you are trying to do and you get more opportunities to innovate around it.
If you are looking at a traditional org oftentimes these specialist teams are spending time justifying their own existence. They invent their own projects and they do things no one needs. These days at DRW we find that we have long shopping lists of deep unix things that we actually need. The Unix specialists are now constantly working on key useful features. We are always looking for more expert unix admins.
The last lesson learned, a painful lesson, is that “people have to buy in”. The CIO can’t just walk in and say you have to start doing DevOps. You can’t force it. We made a mistake recently and we learned from it and turned it into a success. A few months ago we were looking at source control usage. The infrastructure teams were not leveraging this stuff enough for my taste among other things. I said, we need to get these guys pairing with a software engineer. I forced it. It went along these lines: the person doing the pairing was not teaching the person they were pairing with. They were instead just focused on solving the problem of the moment. The person being paired with was not bought in to even doing the pairing in the first place. People resented this whole arrangement.
We took a hard retrospective look at this and in the end we practiced iterative agile management and changed course. I worked with Dan North who came from a software engineering background and who also had a lot of DevOps practice. A key thing about Dan is that he loves to teach and coach other people. The fact that he loved coaching was a huge help. Dan sat with folks on the networking team and got buy-in from them. He got them invested in the changes we wanted to make. The head of the networking team now is learning python and using version control. Now the network team is standing up self service applications that are adding huge value for the rest of the organization and making us much more efficient.
Some lessons learned from the technology
Ok, so Seth has covered a lot of the cultural bits and pieces. Now I [Chris Read] will talk about the technical lessons or at least lessons stemming from technical issues. To follow are a few examples that have reinforced some of the cultural things we have done. The first one is the story of the lost packet. This happened within the first month or 2 of me joining. We had an exchange sending out market data, through a few hops, to a server that every now and again loses market data. We know this because we can see gaps in the sequence numbers.
The first thing we would do is check the exchange to see if it was actually mis-sequencing the data. Nope, that was not the problem. So then the dev team went down to check the server itself. The unix team looks at the machine, the ip stack, the interfaces, etc… they declared the machine fine. Next the network guys jump in and see that everything is fine there. The server however was still missing data. So we jump in and look at the routers. Guess what, everything looks fine. This is where I [Chris Read] got involved. This problem is what you call the call center conundrum. People focus on small parts of the infrastructure and with the knowledge that they have things look fine. I got in and luckily in previous lives I have been a network admin and a unix admin. I dig in and I can see that the whole network up to the machine was built with high availability pairs. I dig into these pairs. The first ones looked good. I look into more and then finally get down to one little pair at the bottom and there was a different config on one of the machines. A single line problem. Solving this fixed it. It was only though having a holistic view of the system and having the trust of the org to get onto all of these machines that I was able to find the problem.
The next story is called “monitoring giants”. This also happened quite early in my dealings at DRW. This one taught me a very interesting lesson. I had been in London for 6 weeks and lots of folks were talking about monitoring. We needed more monitoring. I set up a basic Zenoss install and other such things. I came to Chicago and my goal was to show the folks here how monitoring was done by mean to inspire the Chicago folks. I go to show them things about monitoring and I was met with fairly negative response. The guys perceived my work as a challenge on their domain. My whole point in putting this together was lost. I learned the lesson of starting to work with folks early on and being careful about how you present things. It was also a lesson on change. It is only in the last couple of months that I have learned how difficult change can be for a lot of people. You have to take this into account when pushing change. Another bit of this lesson is that you need to make your intentions obvious – over-communicate.
We actually think it is ok to recreate the wheel if you are going to innovate. What is not ok is to recreate it without telling the folks that currently own it. – Seth Thompson.
The next lesson is about DNS. This one was quite surprising to me. It is all about unintended consequences. Our DNS services used to handle a very low number of requests. As we started introducing DevOps there was a major ramp up in requests to DNS per second. We were not actually monitoring it though. All of a sudden people started noticing latency. People started to say “hey, why is the Internet slow?”. Network people looked at all kinds of things and then the problem seemed to solve itself. We let it go. Then a few weeks later, outage! The head of our Windows team noticed that one host was doing 112k lookups per second. Some developers wrote a monitoring script that did a DNS lookup in a tight loop. We have now added all this to our monitoring suite. Because the windows team had been taught about network monitoring and log file analysis, because they had been exposed, they were able to catch and fix this problem themselves.
Quick summary of the lessons
Communication is very key. You must spend time with the people you are asking to change the way they are working.
Get buy-in, don’t push. As soon as you push something onto someone, they are going to push back. Something will break, someone will get hurt. You need to develop a pull – they must pull change from you they must want it.
Keep iterating. Keep get better and make room for failure. If people are afraid of mistakes they won’t iterate.
Finally, change is hard. Change is hard, but it is the only constant. As you are developing you will constantly change. Make sure that your organization and your people are geared toward healthy attitudes about change.
Question: Can you talk a little bit more about buy-in.
Answer: One of the most important thing about getting buy-in is to prove your changes out for them. Try things on a smaller scale, prototypes or process or technology, get a success and hold it up as an example of why it should be scaled out further.
This was live blogged by @martinjlogan so please forgive any errors and typos.
The way we do things in production is not always the right way to do things. Coming here to a conference like Camp DevOps and listening to folks like Jez Humble is kind of like coming to Church and reupping your faith in what’s right!
Handcrafted; great for a lot of things. Furniture, clothes, and shoes. The imperfections give a thing character. Handcrafted however has no place in the datacenter. Services are like appliances. Imagine that you run a laundromat. Would you rather have a dozen different machines that all need to be repaired in different ways by different people or would you rather have one industrial strength uniform design for each unit?
In Groupon’s infancy in order to get started quickly we outsourced all operations. We have gotten to the scale though where the expertise of those we outsourced to is not sufficient for our current needs. As a result we have brought it in house now. Given this we needed a way to manage our infrastructure efficiently and with minimal errors under constant change.
In Sept 2010 we had about 100 servers in one datacenter. Many of them were handcrafted. That was ok though, because someone else worked on them. Today we have over 1000 servers in 6 locations. As the service has grown we have felt the pain of a shakey foundation under our platform. That is the driver behind developing this project – Roller. Roller really embodies the DevOps mindset.
The DevOps mindset is typified by folks that love developing software and that are also interested in linux kernels and such – and vice versa. I am one such person. At Groupon I do work for many different areas. I started my career at Yahoo in 1999. I also co-founded a company called Dippidy. I left there and worked for Symantec. Each time I have worn a different hat. Enough about me and my stuff though – lets dig into Roller.
I won’t be giving a philosophical talk but instead will get you into the nuts and bolts of roller [I will summarize this in this live blog - see the slides for more details]. If you have any preconceived notions about how host config and management should be done please try to forget them as this project is quite different. This project is on track to be open sourced from Groupon sometime in the first half of next year.
So, what does Roller do? It installs software. Really, what is a server, it is a computer that has some software on it. Roller installs this software. It facilitates versioning your servers in a super clean way. It allows for perfect consistency across your data center. This handles basic system utils like strace to application deployment. They are all the same, just files on a disk.
You are probably asking yourself why is this guy up here reinventing the wheel. Why do this? We already have Chef and Puppet why bother. Well, we wanted this to be very lightweight. Some existing solutions require message queues, relational DBs, and strange languages that are not already on the system. We also needed to deal with platform specific differences. We have 4 different varieties of Linux. The big thing though, is we wanted a system that was dead simple and audit-able. A lot of the systems now give you tons of power. Inheritance heirarchys like webserver -> apache webserver -> some config of that server etc… That looks great from a programmer brain perspective, but in production this complexity can cause unwanted side effects and cause problems. We wanted to build a system that was blocked on a source code repo commit. We wanted any change in the system to go through git or some other VCS system.
There are 4 parts to roller.
1. The configuration repository.
2. Config server
4. “roll” program.
The configuration repository is a collection of yaml files. The config server sits in each data center. This is a web server that does not have any db. It is a ruby on rails app with no database. It provides views of data stored in the config repository. Packages are precompiled chunks of software. For instance we have an apache package, or some other appliction. A package is just a tarball. The packages are stored and distributed from the config server. Config servers in different datacenters use S3 to distribute packages. We put a package on one config server and then it is world wide in about a minute. Finally we have “roll”. This is what we execute on a host, a blank machine perhaps, to turn it in to a specific appliance.
This contains simple files that have within them information about datacenters. This also contains host classes – basically configurations of particular host types. These host classes are just like defining a class in Ruby or some other language that supports classes. The config repo is basically a tree of a fixed depth of 2.5 levels and no deeper. The leaf nodes are the host files. These are contained in the host directory within the config repo. This defines configuration at the host level. Host classes have names and versions for a particular host. The hostclass does not contain a version.
We have spoken alot about these yaml files. These are the world for roller. Now to make use of them we need the config server. The config server is a rails app that gives us views of the config repo data. We get to see the yaml config, we can see which hosts are using a particular big of config, we can diff configs to see what changed. A nice thing about this is that you can just run curl commands to investigate the system.
I can also use curl to investigate host classes. Config server just pulls these things out from a git repo and sends them back. This creates a nice http bridge into our running system. This has a lot of value. We will see this with roll.
Groupon Roller’s Roll server executes code on a host. It runs the http fetches, just like you would do with curl, fetching the host and host class yaml files from the config server. It then downloads any packages that it does not have. It then prepares a new /usr/local dir candidate. It generates configs. It stops services. Moves the new /usr/local into place, then starts the services. Basically each time it nukes the host starting from a new base state. Roller owns user local essentially. This is kind of a nuclear solution. We are not quite re-imaging the whole host but it is still fairly brutal.
This whole cycle typically takes 10 to 30 seconds. The actual services are down for just a few seconds normally. Things are actually only down for a short period of time.
This is a roller package. It is in every hostclass. It adds a cron entry when installed on a host. Every x minutes it trues up its users with a config repository via a config host. This is how we do basic user management. You can use this to manage your own profiles and user directories to get your .profile or emacs config or whatever you want on all the hosts in which you have access to.
Wrapping it up
on twitter at @thenobot
steinkamp.us/campdevops_notes.pdf is where you can get notes on the presentation.