Category → CaseStudy
By James Turnbull (@kartar), VP of Technology Operations, Puppet Labs Inc.
A sysadmin’s time is too valuable to waste resolving conflicts between operations and development teams, working through problems that stronger collaboration would solve, or performing routine tasks that can – and should – be automated. Working more collaboratively and freed from repetitive tasks, IT can – and will – play a strategic role in any business.
At Puppet Labs, we believe DevOps is the right approach for solving some of the cultural and operational challenges many IT organizations face. But without empirical data, a lot of the evidence for DevOps success has been anecdotal.
To find out whether DevOps-attuned organizations really do get better results, Puppet Labs partnered with IT Revolution Press IT Revolution Press to survey a broad spectrum of IT operations people, software developers and QA engineers.
The data gathered in the 2013 State of DevOps Report proves that DevOps concepts can make companies of any size more agile and more effective. We also found that the longer a team has embraced DevOps, the better the results. That success – along with growing awareness of DevOps – is driving faster adoption of DevOps concepts.
DevOps is everywhere
Our survey tapped just over 4,000 people living in approximately 90 countries. They work for a wide variety of organizations: startups, small to medium-sized companies, and huge corporations.
Most of our survey respondents – about 80 percent – are hands-on: sysadmins, developers or engineers. Break this down further, and we see more than 70 percent of these hands-on folks are actually in IT Ops, with the other 30 percent in development and engineering.
DevOps orgs ship faster, with fewer failures
DevOps ideas enable IT and engineering to move much faster than teams working in more traditional ways. Survey results showed:
- More frequent and faster deployments. High-performing organizations deploy code 30 times faster than their peers. Rather than deploying every week or month, these organizations deploy multiple times per day. Change lead time is much shorter, too. Rather than requiring lead time of weeks or months, teams that embrace DevOps can go from change order to deploy in just a few minutes. That means deployments can be completed up to 8,000 times faster.
- Far fewer outages. Change failure drops by 50 percent, and service is restored 12 times faster.
Organizations that have been working with DevOps the longest report the most frequent deployments, with the highest success rates. To cite just a few high-profile examples, Google, Amazon, Twitter and Etsy are all known for deploying frequently, without disrupting service to their customers.
Version control + automated code deployment = higher productivity, lower costs & quicker wins
Survey respondents who reported the highest levels of performance rely on version control and automation:
- 89 percent use version control systems for infrastructure management
- 82 percent automate their code deployments
Version control allows you to quickly pinpoint the cause of failures and resolve issues fast. Automating your code deployment eliminates configuration drift as you change environments. You save time and reduce errors by replacing manual workflows with a consistent and repeatable process. Management can rely on that consistency, and you free your technical teams to work on the innovations that give your company its competitive edge.
What are DevOps skills?
More recruiters are including the term DevOps in job descriptions. We found a 75 percent uptick in the 12-month period from January 2012 to January 2013. Mentions of DevOps as a job skill increased 50 percent during the same period.
In order of importance, here are the skills associated with DevOps:
- Coding & scripting. Demonstrates the increasing importance of traditional developer skills to IT operations.
- People skills. Acknowledges the importance of communication and collaboration in DevOps environments.
- Process re-engineering skills. Reflects the holistic view of IT and development as a single system, rather than as two different functions.
Interestingly, experience with specific tools was the lowest priority when seeking people for DevOps teams. This makes sense to us: It’s easier for people to learn new tools than to acquire the other skills.
It makes sense on a business level, too. After all, the tools a business needs will change as technology, markets and the business itself shift and evolve. What doesn’t change, however, is the need for agility, collaboration and creativity in the face of new business challenges.
About the author:
A former IT executive in the banking industry and author of five technology books, James has been involved in IT Operations for 20 years and is an advocate of open source technology. He joined Puppet Labs in March 2010.
This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.
The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.
There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.
How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.
Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.
Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.
* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns
Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.
Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re still working on auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.
Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.
Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.
Chuck tries to avoid the “D” “O” word… DevOps. But he was impressed by a John Allspaw presentation at Velocity 09 “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr“. This led him to set up a bootcamp session at Facebook and this post is based on what he tells new developers.
Developers want to get code out as fast as possible. Release Engineers don’t want anything to break. So there’s a need for a process. “Can I get my rev out?” “No. Go away”. That doesn’t work. They’re all working to make change. Facebook operates at ludicrous speed. They’re at massive scale. No other company on earth moves as fast with at their scale.
Chuck has two things at his disposal: tools and culture. He latched on to the culture thing after Allspaw’s talk. The first thing that he tells developers is that they will shepherd their changes out to the world. If they write code and throw it over the wall, it will affect Chuck’s Mom directly. You have to deal with dirty work and it is your operational duty from check-in to trunk to in-front-of-my-Mom. There is no QA group at Facebook to find your bugs before they’re released.
How do you do this? You have to know when and how a push is done. All systems at Facebook follow the same path, and they push every day.
How does Facebook push?
Chuck doesn’t care what your source control system is. He hates them all. They push from trunk. On Sunday at 6p they take trunk and cut a branch called latest. Then they test for two days before shipping. This is the old school part. Tuesday they ship, then Wed-Fri they cherry pick more changes. 50-300 cherry picks per day are shipped.
But Chuck wanted more. “Ship early and ship twice as often” was a post he wrote on the Facebook engineering blog. (check out the funny comments). They started releasing 2x/day in August. This wasn’t as crazy as some people thought, because the changes were smaller with the same number of cherry picks per day.
About 800 developers check in per week. It keeps growing as they hire more. There’s about 10k commits per month to a 10M LOC codebase. But the rate of cherry picks per day has remained pretty stable. There is a cadence for how things go out. So you should put most of your effort into the big weekly release. Then lots of stuff crowds in on Wed as fixes come in. Be careful on Friday. At Google they had “no push Fridays”. Don’t check in your code and leave. Sunday and Monday are their biggest days, as everyone uploads and views all the photos from everyone else’s drunken weekend.
Give people an out. If you can’t remember how to do a release, don’t do anything. Just check into trunk and you can avoid the operational burden of showing up for a daily release.
Remember that you’re not the only team shipping on a given today. Coordinate changes for large things so you can see what’s planned company wide. Facebook uses Facebook groups for this.
You should always be testing. People say it but don’t mean it, but Facebook takes it very seriously. Employees never go to the real facebook.com because they are redirected to http://www.latest.facebook.com. This is their production Facebook plus all pending changes, so the whole company is seeing what will go out. Dogfooding is important. If there’s a fatal error, you get directed to the bug report page.
File bugs when you can reproduce them. Make it easy and low friction for internal users to report an issue. The internal Facebook includes some extra chrome with a button that captures session state, then routes a bug report to the right people.
When Chuck does a push, there’s another step in that developers’ changes are not merged unless you’ve shown up. You have to reply to a message to confirm that you’re online and ready to support the push. So the actual build is http://www.inyour.facebook.com which has fewer changes than latest.
Facebook.com is not to be used as a sandbox. Developers have to resist the urge to test in prod. If you have a billion users, don’t figure things out in prod. Facebook has a separate complete and robust sandbox system.
On-call duties are serious. They make sure that they have engineers assigned as point of contact across the whole system. Facebook has a tool that allows quick lookup of on-call people. No engineer escapes this.
Facebook does everything in IRC. It scales well with up to 1000 people in a channel. Easy questions are answered by a bot. There is a command to lookup the status of any rev. They also have a browser shortcut as well. Bots are your friends and they track you like a dog. A bot will ask a developer to confirm that they want a change to go out.
Where are we?
Facebook has a dashboard with nice graphs showing the status of each daily push. There is also a test console. When Chuck does the final merge, he kicks off a system test immediately. They have about 3500 unit test suites and he can run one each machine. He reruns the tests after every cherrypick.
There are thousands and thousands of web servers. There’s good data in the error logs but they had to write a custom log aggregator to deal with the volume. At Facebook you can click on a logged error and see the call stack. Click on a function and it expands to show the git blame and tell you who to assign a bug to. Chuck can also use Scuba, their analysis system, which can show trends and correlate to other events. Hover over any error, and you get a sparkline that shows a quick view of the trend.
This is one of Facebook’s main strategic advantages that is key to their environment. It is like a feature flag manager that is controlled by a console. You can turn new features on selectively and restrain the set of users who see the change. Once they turned on “fax your photo” for only Techcrunch as a joke.
Chuck’s job is to manage risk. When he looks at the cherry pick dashboard it shows the size of the change, and the amount of discussion in the diff tool (how controversial is the change). If both are high he looks more closely. He can also see push karma rated up to five stars for each requestor. He has an unlike button to downgrade your karma. If you get down to two stars, Chuck will just stop taking your changes. You have to come and have a talk with him to get back on track.
This is a great tool that does a full performance regression on every change. It will compare perf of trunk against the latest branch.
HipHop for PHP
This generates about 600 highly optimized C++ files that are then linked into a single binary. But sometimes they use interpreted PHP in dev. This is a problem that they plan to solve with the PHP virtual machine that they plan to open source.
This is how they distribute the massive binary to many thousands of machines. Clients contact Open Tracker server for list of peers. There is rack affinity and Chuck can push in about 15 minutes.
Tools alone won’t save you
The main point is that you cannot tool your way out of this. The people coming on board have to be brainwashed so they buy into the cultural part. You need the right company with support from the top all the way down.
This guest post is based on a presentation given by @mattkemp, @chicagobuss, and @codyaray at CloudConnect Chicago 2012
As a fast-growing tech company in a highly dynamic industry, BrightTag has made a concerted effort to stay true to our development philosophy. This includes fully embracing open source tools, designing for scale from the outset and maintaining an obsessive focus on performance and code quality (read our full Code to Code By for more on this topic).
Our recent CloudConnect presentation, Automating Cloud Applications Using Open Source, highlights much of what we learned in building BrightTag ONE, an integration platform that makes data collection and distribution easier. Understanding many of you are also building large, distributed systems, we wanted to share some of what we’ve learned so you, too, can more easily automate your life in the cloud.
BrightTag utilizes cloud providers to meet the elastic demands of our clients. We also make use of many off-the-shelf open source components in our system including Cassandra, HAProxy and Redis. However, while each component or tool is designed to solve a specific pain point, gaps exist when it comes to a holistic approach to managing the cloud-based software lifecycle. The six major categories below explain how we addressed common challenges that we faced and it’s our hope that these experiences help other growing companies grow fast too.
Service Oriented Architecture
Cloud-based architecture can greatly improve scalability and reliability. At BrightTag, we use a service oriented architecture to take advantage of the cloud’s elasticity. By breaking a monolithic application into simpler reusable components that can communicate, we achieve horizontal scalability, improve redundancy, and increase system stability by designing for failure. Load balancers and virtual IP addresses tie the services together, enabling easy elasticity of individual components; and because all services are over HTTP, we’re able to use standard tools such as load balancer health checks without extra effort.
Most web services require some data to be available in all regions, but traditional relational databases don’t handle partitioning well. BrightTag uses Cassandra for eventually consistent cross-region data replication. Cassandra handles all the communication details and provides a linearly scalable distributed database with no single point of failure.
In other cases, a message-oriented architecture is more fitting, so we designed a cross-region messaging system called Hiveway that connects message queues across regions by sending compressed messages over secure HTTP. Hiveway provides a standard RESTful interface to more traditional message queues like RabbitMQ or Redis, allowing greater interoperability and cross-region communication.
Zero Downtime Builds
Whether you have a website or a SaaS system, everyone knows uptime is critical to the bottom line. To achieve 99.995% uptime, BrightTag uses a combination of Puppet, Fabric and bash to perform zero downtime builds. Puppet provides a rock-solid foundation for our systems. We then use Fabric to push out changes on demand. We use a combinations of haproxy and built-in health checks to make sure that our services are always available.
Whether you use a dedicated DNS server or /etc/hosts files, to keep a flexible environment functioning properly, you need to update your records. This includes knowing where your instances are on a regular and automatic basis. To accomplish this, we use a tool called Zerg, a Flask web app that leverages libcloud to abstract away the specific cloud provider API from the common operations we need to do regularly in all our environments.
HAProxy Config Generation
Zerg allows us to do more than just generate lists of instances with their IP addresses. We can also abstractly define our services in terms of their ports and health check resource URLs, giving us the power to build entire load balancer configurations filled in with dynamic information from the cloud API where instances are available. We use this plus some carefully designed workflow patterns with Puppet and git to manage load balancer configuration in a semi-automated way. This approach maximizes safety while maintaining an easy process for scaling our services independently – regardless of the hosting provider.
Application and OS level monitoring is important to gain an understanding of your system. At BrightTag, we collect and store metrics in Graphite on a per-region basis. We also expose a metrics service per-region that can perform aggregation and rollup. On top of this, we utilize dashboards to provide visibility across all regions. Finally, in addition to visualizations of metrics, we use open source tools such as Nagios and Tattle to provide alerting on metrics we’ve identified as key signals.
There is obviously a lot more to discuss when it comes to how we automate our life in the cloud at BrightTag. We plan to post more updates in the near future to share what we’ve learned in the hopes that it will help save you time and headaches living in the cloud. In the meantime, check out our slides from CloudConnect 2012.
This post was live blogged by @martinjlogan so expect errors.
This talk is about how to overcome organizational hurdles and get DevOps humming in your org. This illustrates how we did it at DRW Trading.
DRW needed to adjust. The problem was that we are not exposing people to problems upfront. Everyone was only exposed to their local problems and only optimized locally. We looked and continue to look at DevOps as our tool to change this.
[Seth is talking a bit about the lessons that were learned at DRW that can really be applied at all levels in the org.]
The first ting you need to do if you are introducing DevOps to your org is define what DevOps is do you. Gartner has an interesting definition, not sure if it reflects our opinions, but at least they are trying to figure it out. At DRW we use the words “agile operations” and DevOps interchangeably. We are integrating IT operations with agile and lean principles. Fast iterative work, embedding people on teams and moving people as close to the value they are delivering as possible. DevOps is not a job, it is a way of working. You can have people in embedded positions using these practices as easily as you can for folks in shared teams.
The next thing you need to do is focus on the problem that you are trying to solve. This is obvious but not all that simple. Here is an example. We had a complaint from our high frequency trading folks last year saying that servers were not available fast enough. It took on average 35 days for us to get a server purchased and ready to run. Dan North and I were reading the book “The Goal” – a book I highly recommend. It is a really good read. In the book he talks about the theory of constraints and applying lean principles to repeatable process. We used a technique called value stream mapping to our server delivery process. People complained that I [Seth] was a bottleneck becuase I had to approve all server purchases. Turned out I only take 2 hours to do that. The real problem laid elsewhere. The value stream mapping allowed us to see where our bottlenecks were so that we could focus in on our real bottlenecks and not waste cycles on less productive areas. We zeroed in accurately and reduced the time from 35 to 12 days.
The third cultural lesson, and an important one, is keep your specialists. One of the worst things that can happen is that you introduced a lot of general operators and then the network team, for example, says wow, you totally devalued me, and they quit. You lose a lot of expertise that it turns out is quite useful this way. Keep your specialists in the center. You want to highlight the tough problems to the specialists and leverage them for solving those problems. Introducing DevOps can actually open the floodgates for more work for the people in the center. We endeavored to distribute unix system management to reduce the amount of work for the Unix team itself. This got people all across the org a bit closer to what was going on in this domain. What actually happened is that the Unix team was hit harder than ever. As we got people closer to the problem the demand that we had not seen or been able to notice previously increased quite a bit. This is a good problem to have because you start to understand more of what you are trying to do and you get more opportunities to innovate around it.
If you are looking at a traditional org oftentimes these specialist teams are spending time justifying their own existence. They invent their own projects and they do things no one needs. These days at DRW we find that we have long shopping lists of deep unix things that we actually need. The Unix specialists are now constantly working on key useful features. We are always looking for more expert unix admins.
The last lesson learned, a painful lesson, is that “people have to buy in”. The CIO can’t just walk in and say you have to start doing DevOps. You can’t force it. We made a mistake recently and we learned from it and turned it into a success. A few months ago we were looking at source control usage. The infrastructure teams were not leveraging this stuff enough for my taste among other things. I said, we need to get these guys pairing with a software engineer. I forced it. It went along these lines: the person doing the pairing was not teaching the person they were pairing with. They were instead just focused on solving the problem of the moment. The person being paired with was not bought in to even doing the pairing in the first place. People resented this whole arrangement.
We took a hard retrospective look at this and in the end we practiced iterative agile management and changed course. I worked with Dan North who came from a software engineering background and who also had a lot of DevOps practice. A key thing about Dan is that he loves to teach and coach other people. The fact that he loved coaching was a huge help. Dan sat with folks on the networking team and got buy-in from them. He got them invested in the changes we wanted to make. The head of the networking team now is learning python and using version control. Now the network team is standing up self service applications that are adding huge value for the rest of the organization and making us much more efficient.
Some lessons learned from the technology
Ok, so Seth has covered a lot of the cultural bits and pieces. Now I [Chris Read] will talk about the technical lessons or at least lessons stemming from technical issues. To follow are a few examples that have reinforced some of the cultural things we have done. The first one is the story of the lost packet. This happened within the first month or 2 of me joining. We had an exchange sending out market data, through a few hops, to a server that every now and again loses market data. We know this because we can see gaps in the sequence numbers.
The first thing we would do is check the exchange to see if it was actually mis-sequencing the data. Nope, that was not the problem. So then the dev team went down to check the server itself. The unix team looks at the machine, the ip stack, the interfaces, etc… they declared the machine fine. Next the network guys jump in and see that everything is fine there. The server however was still missing data. So we jump in and look at the routers. Guess what, everything looks fine. This is where I [Chris Read] got involved. This problem is what you call the call center conundrum. People focus on small parts of the infrastructure and with the knowledge that they have things look fine. I got in and luckily in previous lives I have been a network admin and a unix admin. I dig in and I can see that the whole network up to the machine was built with high availability pairs. I dig into these pairs. The first ones looked good. I look into more and then finally get down to one little pair at the bottom and there was a different config on one of the machines. A single line problem. Solving this fixed it. It was only though having a holistic view of the system and having the trust of the org to get onto all of these machines that I was able to find the problem.
The next story is called “monitoring giants”. This also happened quite early in my dealings at DRW. This one taught me a very interesting lesson. I had been in London for 6 weeks and lots of folks were talking about monitoring. We needed more monitoring. I set up a basic Zenoss install and other such things. I came to Chicago and my goal was to show the folks here how monitoring was done by mean to inspire the Chicago folks. I go to show them things about monitoring and I was met with fairly negative response. The guys perceived my work as a challenge on their domain. My whole point in putting this together was lost. I learned the lesson of starting to work with folks early on and being careful about how you present things. It was also a lesson on change. It is only in the last couple of months that I have learned how difficult change can be for a lot of people. You have to take this into account when pushing change. Another bit of this lesson is that you need to make your intentions obvious – over-communicate.
We actually think it is ok to recreate the wheel if you are going to innovate. What is not ok is to recreate it without telling the folks that currently own it. – Seth Thompson.
The next lesson is about DNS. This one was quite surprising to me. It is all about unintended consequences. Our DNS services used to handle a very low number of requests. As we started introducing DevOps there was a major ramp up in requests to DNS per second. We were not actually monitoring it though. All of a sudden people started noticing latency. People started to say “hey, why is the Internet slow?”. Network people looked at all kinds of things and then the problem seemed to solve itself. We let it go. Then a few weeks later, outage! The head of our Windows team noticed that one host was doing 112k lookups per second. Some developers wrote a monitoring script that did a DNS lookup in a tight loop. We have now added all this to our monitoring suite. Because the windows team had been taught about network monitoring and log file analysis, because they had been exposed, they were able to catch and fix this problem themselves.
Quick summary of the lessons
Communication is very key. You must spend time with the people you are asking to change the way they are working.
Get buy-in, don’t push. As soon as you push something onto someone, they are going to push back. Something will break, someone will get hurt. You need to develop a pull – they must pull change from you they must want it.
Keep iterating. Keep get better and make room for failure. If people are afraid of mistakes they won’t iterate.
Finally, change is hard. Change is hard, but it is the only constant. As you are developing you will constantly change. Make sure that your organization and your people are geared toward healthy attitudes about change.
Question: Can you talk a little bit more about buy-in.
Answer: One of the most important thing about getting buy-in is to prove your changes out for them. Try things on a smaller scale, prototypes or process or technology, get a success and hold it up as an example of why it should be scaled out further.