Category → sysadmin
Devops from a sysadmin perspective
This year LISA (Large Installation System Administration) 2011 Conference has a theme on "devops".
The LISA crowd has been practicing automation for a long time, and many of them just look at devops as something they have always been doing.
So they have asked me to write an article for Usenix ;Login magazine to explain devops from a sysadmin perspective. As the article requires a subscription ,I'm re-posting it here for others to enjoy :)
Introduction
While there is not one true definition of devops (similar to cloud), four of it's key-points resolve around Culture, Automation, Measurement and Sharing (CAMS). In this article we will show how this affects the traditional thinking of the sysadmin.
As a sysadmin you are probably familiar with the Automation and Measurement part: it has been good and professional practice to script/automate work to make things faster and repeatable. Gathering metrics and doing monitoring is an integral part of the job to make sure things are running smoothly.
The pain
For many years, operations (of which the sysadmin is usually part) has been seen as an endpoint in the software delivery process: developers code new functionality during a project in isolation from operations and once the software is considered finished, it is presented to the operations departement to run it.
During deployment a lot of issues tend to surface: some typical examples are the development and test environment not being representative to the production environment, or that not enough thought has been given to backup and restore strategies. Often it is too late in the project to change much of the architecture and structure of the code and it gives way to many fixes and ad-hoc solutions. This friction has created a disrespect between the two groups: developers feel that operations knows nothing about software, and operation feel that developers know nothing about running servers. Management tends to keep those two groups in isolation from each other, keeping the interaction at the minimum required. The result is a 'wall of confusion'
Culture of collaboration
Historically two drivers have fuelled devops: the first one was Agile Development which led in many companies to many more deployment than operations was used to. The second one was Cloud and large scale web operations , where the scale required a much closer collaboration between development and operations.
When things really go wrong, organizations often create a multi-disciplined task force to tackle production problems. Truth is that in today's IT, environments have become so complex that they can't be understood by one person or even one group. Therefore instead of separating developers and operations as we used to do, we need to bring them together more closely: we need more practice, and the motto should be "if it's hard do it more often".
Devops recognizes that software only provides value if it's running production and running a server without software does not provide value either. Development and operations are both working to serve the customer not for running their own department.
Although many sysadmins have been collaborating with other departments, it has never been seen as a strategic advantage. The cultural part of devops, seeks to promote this constant collaboration across silos, in order to better meet the business demands. It goes for 'friction-less' IT and promotes the cross-departmental/cross-disciplinary approach.
A good place to get started with collaboration are places where the discussion often escalates: deployment, packaging, testing, monitoring, building environments. These places can be seen as boundary objects: places where every silo has it's own understanding of. These are exactly the places where technical debt accumulates so they should contain real pain issues.
Culture of sharing
Silo's exist in many forms in the organization, not only between developers and operations. In some organizations there are even silos inside of operations: network, security, storage, servers avoid collaboration and each work in their own world. This has been referred to as the Ops-Ops problem. So in geek-speak devops is actually a wildcard for devops* collaboration.
Devops doesn't mean all sysadmins need to know how to code software now, or all developers need to know how to install a server. By collaborating constantly, both groups can learn from each other, but can also rely on each other to do the work. A similar approach has been promoted by Agile between developers and testers. Devops can be seen as the extend of bringing system administrators into the Agile equation.
Starting the conversation sometimes takes courage but think about the benefits: you get to learn the application as it grows, and you can actively shape it by providing your input during the process. A sysadmin has a lot to offer to the developers: f.i. you have the knowledge of how production looks like, therefore you can build representative environment in test/dev. You can be involved in loadtesting, failover testing. Or you can setup a monitoring system that developmers can use to see what's wrong. Give access to production logs so developers can understand real world usage.
A great way to share information and knowledge is by pairing together with developer or collegues: while you are deploying code he comments on what the impact is on the code and allows you to directly ask questions. This interaction is of great value to understand both worlds better.
Revisiting Automation
Like specified in the Agile Manifesto, devops values "Individuals and interactions over processes and tools". The great thing about tools is that they are concrete and can have a direct benefit as opposed to culture. It was hard to grasp the impact of Virtualization and Cloud unless you started doing it. Tools can shape the way we work and consequently change our behavior.
A good example is Configuration Management and Infrastructure as code. A lot of people rave about it's flexibility and power for the automation. If you look beyond the effect of saving time, you will find that it also has a great sharing aspects: It has created a 'shared' language that allows you can know exchange the way you manage systems with collegues and even outside your company by publishing recipes/cookbooks on github. Because we know use concepts as version control and testing we have a common problemspace with developers. And most importantly the automation is freeing us from the trivial stuff and allows us to discuss and focus on the stuff that really matters.
Revisiting Metrics
Measuring the effects of collaboration can't be done by measuring the number of interactions, after all more interaction doesn't mean a better party. It's similar to a black hole , you have to look at the objects nearby. So how do you see that things are improving? As an engineer you collect metrics about number of incidents, failed deploys, number of succesful deploys, number of tickets. Instead of keeping these information in their own silo, you radiate this to the other parts of the company so they could learn from them. Celebrate successes and failure and learn from them. Doing post-mortems with all parties involved and improve on it. Again this changes the focus of metrics and monitoring from only fast fixing to feedback to the whole organization. Aim to optimize the whole instead of only your own part.
The secret sauce
Several of the 'new' companies have been front-runners in these practices. Google with their two-pizza team approach, Flick with their 10 deploys a day where front runners in the field, but also more traditional companies like National Instruments are seeing the value from this culture of collaboration. They see collaboration as the 'secret sauce', that will set them apart from their competition. Why? Because it recognizes the individual not as a resource but as resourceful to tackle the challenges that exist in this complex world called IT.
Links index:
- Patrick Debois's Devopsdays Melbourne Keynote
- John Willis, What devops means to me
- Damon Edwards, what is devops
- Israel Gat, boundary objects in devops
- Agile Manifesto
- Ernest Mueller, Originality and Operations
- Cliff Stoll, The Cuckoos Egg
- Andrew Shaefer, Israel Gat, Patrick Debois Velocity Conference 2011 Devops Metrics"
- Amazon Architecture
- John Allspaw, 10 deploys per day - dev and ops cooperation at flickr
- Jesse Robbins, Operations is a competitive advantage
Enter Stage Right: DevOps
I have worked in environments where reacting was the "normal" way to approach issues and being proactive was just a waste of time... *facepalm*The seemingly simple idea of changing the workplace culture from reactive to proactive was hard to pitch, let alone implement on my own.
18 months ago, I knew there was a better way of doing things and I was determined to find something. I had been watching developers implement Agile Development processes with great envy and I was starting to see concepts that would have great outcomes if adopted in a traditional operations environment.
Finally having something tangible to explore I started looking into Agile Operations, which led me DevOps. Lurking in the shadows for a while, I digested as much as I could of the DevOps movement. I thought it was time to talk to some of the developers and get them on board.
6 months on and here we are...
So what is DevOps to me?
Ideology
DevOps is a way of thinking and acting. It is not a toolkit, but is something we all can practice everyday.
Adam Jacob is spot on with his Velocity 2010 talk, DevOps is an inclusive movement and I believe this includes developers, operations and management.
Business Problem
Although it has stemmed from the trenches of development and operations theatres, DevOps is very much a business problem.
It needs to be supported by the business to be successful, otherwise we end up with "Black Ops" style DevOps and no one benefits.
Stakeholders may point out that DevOps addresses some of these business problems with technical solutions, but DevOps is still a business problem.
Not A Role
You can't be a DevOp, but you can practice DevOps in your role.
In contrast, I believe that in the next couple of years we will see businesses creating roles that were traditionally filled by either developers or system administrators. These roles will filled be the masters of DevOps-fu, but not a "DevOp".
Catalyst For Change
The IT industry is due for a much needed shake up and DevOps has provided a catalyst for this change.
CAMS
CAMS to me is the "Howto guide" of DevOps.
Damon and John have hit the nail on the head with this one:
- Culture
- Automation
- Metrics
- Sharing
I have found CAMS to be the easiest way to for people new to the DevOps scene to understand what it's all about.
So What Is Next?
Spread the word! Get DevOps out there.
The response to DevOps from sysadmins has been amazing, but we need to get other key influencers from within business on board.
We need to start promoting/evangelising DevOps to developers, managers, directors, VPs and C-level executives too.
Sysadmins
- Get together and piggy back a local developer user group meet up (Java, Ruby, Python etc.)
- Talk to developers at work and ask them what you can do to help.
- Gatecrash a user group that is traditionally full of operations folk (Linux, Puppet, Chef etc.)
- Ask your operations team what you could do to make their lives easier.
- Run a internal DevOps meet up at your workplace. Invite your manager.
- Find a DevOps meet up in your area.
Further Reading
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr - John Allspaw and Paul Hammond.
What DevOps Means To Me - John Willis
Velocity 2010 - Adam Jacob on DevOps - Adam Jacob
Enter Stage Right: DevOps
I have worked in environments where reacting was the "normal" way to approach issues and being proactive was just a waste of time... *facepalm*The seemingly simple idea of changing the workplace culture from reactive to proactive was hard to pitch, let alone implement on my own.
18 months ago, I knew there was a better way of doing things and I was determined to find something. I had been watching developers implement Agile Development processes with great envy and I was starting to see concepts that would have great outcomes if adopted in a traditional operations environment.
Finally having something tangible to explore I started looking into Agile Operations, which led me DevOps. Lurking in the shadows for a while, I digested as much as I could of the DevOps movement. I thought it was time to talk to some of the developers and get them on board.
6 months on and here we are...
So what is DevOps to me?
Ideology
DevOps is a way of thinking and acting. It is not a toolkit, but is something we all can practice everyday.
Adam Jacob is spot on with his Velocity 2010 talk, DevOps is an inclusive movement and I believe this includes developers, operations and management.
Business Problem
Although it has stemmed from the trenches of development and operations theatres, DevOps is very much a business problem.
It needs to be supported by the business to be successful, otherwise we end up with "Black Ops" style DevOps and no one benefits.
Stakeholders may point out that DevOps addresses some of these business problems with technical solutions, but DevOps is still a business problem.
Not A Role
You can't be a DevOp, but you can practice DevOps in your role.
In contrast, I believe that in the next couple of years we will see businesses creating roles that were traditionally filled by either developers or system administrators. These roles will filled be the masters of DevOps-fu, but not a "DevOp".
Catalyst For Change
The IT industry is due for a much needed shake up and DevOps has provided a catalyst for this change.
CAMS
CAMS to me is the "Howto guide" of DevOps.
Damon and John have hit the nail on the head with this one:
- Culture
- Automation
- Metrics
- Sharing
I have found CAMS to be the easiest way to for people new to the DevOps scene to understand what it's all about.
So What Is Next?
Spread the word! Get DevOps out there.
The response to DevOps from sysadmins has been amazing, but we need to get other key influencers from within business on board.
We need to start promoting/evangelising DevOps to developers, managers, directors, VPs and C-level executives too.
Sysadmins
- Get together and piggy back a local developer user group meet up (Java, Ruby, Python etc.)
- Talk to developers at work and ask them what you can do to help.
- Gatecrash a user group that is traditionally full of operations folk (Linux, Puppet, Chef etc.)
- Ask your operations team what you could do to make their lives easier.
- Run a internal DevOps meet up at your workplace. Invite your manager.
- Find a DevOps meet up in your area.
Further Reading
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr - John Allspaw and Paul Hammond.
What DevOps Means To Me - John Willis
Velocity 2010 - Adam Jacob on DevOps - Adam Jacob
Effective adhoc commands in clusters
Last night I had a bit of a mental dump on twitter about structured data and non structured data when communicating with a cluster or servers – Twitter fails at this kind of stuff so figured I’ll follow up with a blog post.
I started off asking for a list of tools in the cluster admin space and got some great pointers which I am reproducing here:
fabric, cap, func, clusterssh, sshpt, pssh, massh, clustershell, controltier, rash (related), dsh, chef knife ssh, pdsh+dshbak and of course mcollective. I was also sent a list of ssh related tools which is awesome.
The point I feel needs to be made is that in general these tools just run commands on remote servers. They are not aware of the commands output structure, what denotes pass or fail in the context of the command etc. Basically the commands people run are commands designed for ages to be looked at by human eyes and then parsed by a human mind. Yes they are easy to pipe and grep and chop up, but ultimately it was always designed to be run on one server at a time.
The parallel ssh’ers run these commands in parallel and you tend to get a mash of output. The output is mixed STDOUT and STDERR and often output from different machines are multiplexed into each other so you get a stream of text that is hard to decipher even on 2 machines, not to mention 200 at once.
Take as an example a simple yum command to install a package:
% yum install zsh Loaded plugins: fastestmirror, priorities, protectbase, security Loading mirror speeds from cached hostfile 372 packages excluded due to repository priority protections 0 packages excluded due to repository protections Setting up Install Process Package zsh-4.2.6-3.el5.i386 already installed and latest version Nothing to do
When run on one machine you pretty much immediately know whats going on, package was already there so nothing got done, now lets see cap invoke:
# cap invoke COMMAND="yum -y install zsh"
* executing `invoke'
* executing "yum -y install zsh"
servers: ["web1", "web2", "web3"]
[web2] executing command
[web1] executing command
[web3] executing command
** [out :: web2] Loaded plugins: fastestmirror, priorities, protectbase, security
** [out :: web2] Loading mirror speeds from cached hostfile
** [out :: web3] Loaded plugins: fastestmirror, priorities, protectbase
** [out :: web3] Loading mirror speeds from cached hostfile
** [out :: web3] 495 packages excluded due to repository priority protections
** [out :: web2] 495 packages excluded due to repository priority protections
** [out :: web3] 0 packages excluded due to repository protections
** [out :: web3] Setting up Install Process
** [out :: web2] 0 packages excluded due to repository protections
** [out :: web2] Setting up Install Process
** [out :: web1] Loaded plugins: fastestmirror, priorities, protectbase
** [out :: web3] Package zsh-4.2.6-3.el5.x86_64 already installed and latest version
** [out :: web3] Nothing to do
** [out :: web1] Loading mirror speeds from cached hostfile
** [out :: web1] Install 1 Package(s)
** [out :: web2] Package zsh-4.2.6-3.el5.x86_64 already installed and latest version
** [out :: web2] Nothing to do
** [out :: web1] 548 packages excluded due to repository priority protections
** [out :: web1] 0 packages excluded due to repository protections
** [out :: web1] Setting up Install Process
** [out :: web1] Resolving Dependencies
** [out :: web1] --> Running transaction check
** [out :: web1] ---> Package zsh.x86_64 0:4.2.6-3.el5 set to be updated
** [out :: web1] --> Finished Dependency Resolution
** [out :: web1]
** [out :: web1] Dependencies Resolved
** [out :: web1]
** [out :: web1] ================================================================================
** [out :: web1] Package Arch Version Repository Size
** [out :: web1] ================================================================================
** [out :: web1] Installing:
** [out :: web1] zsh x86_64 4.2.6-3.el5 centos-base 1.7 M
** [out :: web1]
** [out :: web1] Transaction Summary
** [out :: web1] ================================================================================
** [out :: web1] Install 1 Package(s)
** [out :: web1] Upgrade 0 Package(s)
** [out :: web1]
** [out :: web1] Total download size: 1.7 M
** [out :: web1] Downloading Packages:
** [out :: web1] Running rpm_check_debug
** [out :: web1] Running Transaction Test
** [out :: web1] Finished Transaction Test
** [out :: web1] Transaction Test Succeeded
** [out :: web1] Running Transaction
** [out :: web1] Installing : zsh 1/1
** [out :: web1]
** [out :: web1]
** [out :: web1] Installed:
** [out :: web1] zsh.x86_64 0:4.2.6-3.el5
** [out :: web1]
** [out :: web1] Complete!
command finished
zlib(finalizer): the stream was freed prematurely.
zlib(finalizer): the stream was freed prematurely.
zlib(finalizer): the stream was freed prematurely.Most of this stuff scrolled off my screen and at the end all I had was the last bit of output. I could scroll up and still figure out ok what was going on – 2 of the 3 already had it installed, one got it. Now imagine 100 or 500 of these machines output all mixed in? Just parsing this output would be prone to human error and you’re likely to miss that something failed.
So here is my point, your cluster management tool need to provide an API around the every day commands like packages, process listing etc. It should return structured data and you could use the structured data to create tools more fit for the purpose of using on large amount of machines. Being that the output is standardized it should provide generic tools that just do the right thing out of the box for you.
With the package example above knowing that all 500 machines had spewed out a bunch of stuff while installing isn’t important, you just want to know the result in a nice way. Here’s what mcollective does:
$ mc-package install zsh
* [ ============================================================> ] 3 / 3
web2.my.net version = zsh-4.2.6-3.el5
web3.my.net version = zsh-4.2.6-3.el5
web1.my.net version = zsh-4.2.6-3.el5
---- package agent summary ----
Nodes: 3 / 3
Versions: 3 * 4.2.6-3.el5
Elapsed Time: 16.33 sIn the case of a package you want to just know the version post the event and a summary of status. Just by looking at the stats I know the desired result was achieved, if I had different versions listed I could very quickly identify the problem ones.
Here’s another example – NRPE this time:
% mc-rpc nrpe runcommand command=check_disks
* [ ============================================================> ] 47 / 47
dev1.my.net Request Aborted
CRITICAL
Exit Code: 2
Performance Data: /=4111MB;3706;3924;0;4361 /boot=26MB;83;88;0;98 /dev/shm=0MB;217;230;0;256
Output: DISK CRITICAL - free space: / 24 MB (0% inode=86%);
Finished processing 47 / 47 hosts in 766.11 msHere notice I didn’t use a NRPE specific mc- command, I just used the generic rpc caller and the caller knows that I am only interesting in seeing the results of machines that are in WARNING or CRITICAL state. If you run this on your console you’d see the ‘Request Aborted’ would be red and the ‘CRITICAL’ would be yellow. Immediately pulling your eye to the important information. Also note how the result shows human friendly field names like ‘Performance Data’.
The formatting, highlighting, knowledge to only show failing resources and human friendly headings all happen automatically, no programming of client side UI is required you get the ability to do this for free simply from the fact that mcollective focuses on putting structure around outputs.
Here’s the earlier package install example with the standard rpc caller not with a specialized package frontend:
% mc-rpc package install package=zsh Determining the amount of hosts matching filter for 2 seconds .... 47 * [ ============================================================> ] 47 / 47 Finished processing 47 / 47 hosts in 2346.05 ms
Everything worked, all 47 machines have the package installed and your desired action was taken. So no point in spamming you with pages of junk, who cares to see all the Yum output? Had an install failed you’d have had usable error message just for the host that failed. The output would be equally usable on one or a thousand hosts with very little margin for human error in knowing the result of your request.
This happens because mcollective has a standard structure of responses, each response has a absolute success value that tells you if the request failed or not and by using this you can get generic CLI, Web, etc tools that displays large amounts of data from a network of hosts in a way that is appropriate and context aware.
For reference here’s the response as received on the client:
{:sender=>"dev1.my.net", :statuscode=>1, :statusmsg=>"CRITICAL", :data=> {:perfdata=> " /=4111MB;3706;3924;0;4361 /boot=26MB;83;88;0;98 /dev/shm=0MB;217;230;0;256", :output=>"DISK CRITICAL - free space: / 24 MB (0% inode=86%);", :exitcode=>2}}
Only by thinking about CLI and admin tasks in this way do I believe we can take the Unix utilities that we call on remote hosts and turn them into something appropriate for large scale parallel use that doesn’t overwhelm the human at the other end with information. Additionally since this is an API that is computer friendly it makes those tools usable in many other places like code deployers – for example to enable your continues deployment using robust use of unix tools via such an API.
There are many other advantages to this approach. Requests are authorized on a very fine level, requests are audited. API wrappers are code that’s versioned, that can be tested in development and makes the margin for error much smaller than just running random unix commands ad hoc. Finally if you’re using the code on a CLI ad-hoc as above or in your continues deployer you share the same code that you’ve already tested and trust.
20 DevOps guys you should follow
DevOps is an approach to bridge the gap between agile software development and operations. The DevOps tribe is a growing group of people practicing a new way of combining development and system administration for more speed, quality, revenues, and fun. The DevOps Tribe Here is a list of some of the most active guys in [...] Related posts:DevOps:...Translating Code Smells in Server Smells

At xpdays Benelux 2009, I attended an interesting session called 'Developing a Sense of Smells' by Kevin RutherFord and Lindsay McEwan.
The exercise we did went as follows: suppose you are asked to do some work on code you never saw before. How would you assess this, go about estimating the effort and explaining that effort to justify the price/number of days. The first round resulted in terms like 'look for design patterns', 'readibility of the code', ...
Then they explained code smells patterns: f.i. a greedy function: one function that does way to much making it difficult to change, hidden secret: internal knowledge such as the real interpretations of a CSV file data format that can not be deduced by looking at the code. The code smells made it easier to express the problems.
The devops in me,thinks that this can be translated to the sysadmin world. Here are some of the 'smells' I've come up.
Private Playground
The sysadmin uses the system as his toy playground, doesn't clean up.
- /tmp & /var/tmp full of old install files
- / full of files
Gready Server
One server that does every function
- combined mail and web and dns and fileshares
- all users on the same system
Root is the cause of all evil
- last show login all root
- no sudo is activated
- no sshd keys for logins
- nfs share/root?
- Chmod 777
- most processes run as root
Cranky Crutches
Things that are needed to keep the system alive when failed
- /etc/ start but no stop scripts
- kill /stop/start in cron jobs
Nobody lives here anymore
Mainly indicate not much maintenance is done any more
- Last update is a long time ago
- Older kernel versions
- last login was more then x days ago
- olders reboot is a long time ago
Complexity Conspiracy
Loadbalancers, Cluster software, Dependencies
More is Less
- All packages installed
- All services running
- All ports are open
This is just the beginning, so if you have your own ideas/names, just leave a comment.
Xpdays Benelux 2009 – Continuous Integration for the World
Here are the slides of my presentations at Xpdays Benelux 2009. The presentation is on how we could bring sysadmins and developers closer together by using continuous integration in both worlds.
I would very much like to thank Gildas Le Nadan for helping with the first version of it at Xpdays France 2009. Also I want to thank the organizers for this great conference! And for giving me the chance to speak about this rather niche subject.