↓ Archives ↓

Category → operations

A Systems Policy

Recently I talked to a couple of friends, which all wailed quite a bit about their operations or internal IT departments.

Most of these teams had to fight with some very basic things. They lacked a decent monitoring system or monitoring at all. They didn’t deploy systems, they installed it by hand. Systems where not documented etc.

So here are some guidelines, I try to aspire with my team. This is by far not a complete list of things you need to run successful operations but it should give you a fair hint about what it takes.

Also please note that you might want to adapt your own policy a bit to fit your needs. I’m coming from the web industry, but we still run our own hardware, so this might especially not fit a typical cloud based infrastructure.

Systems

A System is considered the lowest part of our infrastructure and services. All rules defined here, should be considered in all other policies.

A system….

  • is documented at a central location.
  • is monitored and being graphed.
  • is being backuped.
  • is updated regularly.
  • has a defined production level. (spare, pre-production, production)
  • has a defined owner and maintainer.
  • has a predefined maintenance level.
  • has a predefined availability.
  • has a physical location.
  • has a unique name, which is resolvable by DNS.
  • has only required software installed.
  • was installed with all currently available updates.
  • was inspected and approved by a second man before being released to production.
  • All parts are functional at any time. All Faults get documented RFN and repaired as soon as possible.
  • There are always 2+ people informed about it.
  • Network access vectors are defined.
  • Configurations are not only available locally (including scripts).
  • Sensible data gets protected.

Hardware

A piece of hardware can be anything from a big server to a small temperature sensor in your server room.

A piece of hardware…

  • has a maintenance contract or spare hardware available.
  • has got an inventory number.
  • is labeled (hostname + inventory).
  • is physically secure (environmental! and mechanical access control).
  • has got a bill, which is documented at a central location.
  • should have redundant power supplies.
  • should have some kind of out of band management solution (OOB).
  • has at least one power circuit connected to an electronic circuit protected by an uninterruptible power supply (USV).

All tools needed to open and repair any part of the system are available.

Servers

A server…

  • has at least two disks configured with RAID >= 1.
  • has at least two separate network interface cards (NICs).
  • has all RAID controllers backed with battery backed write caches (BBWC).
  • was dimensioned with adequate future-proof hardware.
  • has a lifetime of 2+ years.

Switches

A switch…

  • is manage- or configurable.
  • is supported by the configuration backup software in use (e.g. RANCID)
  • provides the following protocols: STP, SNMP, IPv6 support (mgmt+multicast), RADIUS for AAA
  • does not forward the default VLAN (1) on it’s uplink/trunk ports.
  • does have a description for every port in use (including hostname and interface, e.g.: server01#eth0, server01#oob, switch03#24)
  • does not have any enabled, unused ports: set them to disabled and remove any other configuration for this port.
  • blocks or does not forward any discovery protocols on it’s user ports.
  • is using AAA for authenticating users.
  • logs to a central syslog server.

Operating Systems

An operating system (OS) is considered as everything running on a server or instance, to support a service or an application.

An Operating System…

  • uses OS-CHOICE-HERE/stable as default distribution on servers.
  • uses OS-CHOICE-HERE as default on clients.
  • is rebooting without any manual interventions.
  • provides access by SSH.
  • does not permit root login via SSH.
  • has a root password set.
  • has the current time, synchronized with a time server and uses TIMEZONE-CHOICE-HERE as time zone.
  • can resolve internal and internet names via DNS.
  • installs software by packages.
  • installs packages from a central internal repository and the official distribution repositories.
  • software installed by packages should conform to the FHS.
  • software not installed by packages should be installed by a reproducible deployment process.
  • has sane defaults set, for user and process environments (locales, shells, screen, got some handy tools, etc.).
  • should not provide typical compiler tools (gcc, build-essential).
  • provides a manageable AAA concept (e.g. automated provisioning and de-provisioning of staff users).
  • sends mails destinated for root to a central location.
  • provides a local mailer.

Hostnames

Hostnames exist to identify every part of your infrastructure uniquely. They are used to refer to systems in your configurations and in discussions. You should think about a naming convention, but here are some rough guidelines.

Hostnames …

  • have to be unique.
  • have to end with a number, which should never be reused and always be incremented.

Services

A service is considered as everything running on a server’s operating system, to provide continuous functionality (e.g. a script or an application).

A service…

  • does only log errors and auditing information. Application services may as well log more information (e.g. Apache access log).
  • has defined log retention times.
  • logs to syslog unless it’s not possible.
  • is authenticating only on secure connections.
  • has an adequate and future-proof dimensioned datastore.
  • was deployed in a reproducible way.

Networks

A network is considered any part of infrastructure, which is used to interconnect servers or systems. (Layer 1,2,3,4,…)

A Network…

  • has clear entry and routing points.
  • has a diagram which describes access vectors, the logical and physical setup.
  • is deployed in adequate and future-proof dimensions (vlans, ip addresses, bandwidth).
  • uses structured cabling.
  • there is no cross-cabling, except for very rare situations (e.g. HA cabling).
  • should not be used for multiple purposes at least not share one of the following classifications.
    ClassDescription
    net Internet/upstream network
    mgmt Management network (monitoring, remote access)
    traffic Site local traffic network
    backup Traffic network for backups
    voip Voip Telephony network
    clients A network with client workstations.
    devel A network with development machines.
    staging A network with staging equipment.


  • OOBs are easy to reach, even in case of an outage.
  • VLAN-IDs are considered global, create a list.
  • All VLAN-IDs below 99 are switch-local.
  • VLANs have a name and a location.
  • All address space is considered global (vlans, ip- and mac addresses, including RFC1918)

To round up my article, here is a example checklist we use to peer review new systems:

Example Review Checklist

Every newly deployed host or instance should undergo a peer-review process. The checklist below will provide you with a couple of base acceptance criteria and is going to ensure a certain level of quality. Give it to any other sysadmin and ask him or her to check the system, before it’s put into production.

* DNS works (including reverse dns)               :
* SSH login works                                 :
* Host+services monitored                         :
* Host+services graphed                           :
* All Filesystems backuped                        :
* Database dumps                                  :
* All Updates installed                           :
* Host in HostDoc                                 : 
* Puppet works                                    :
* Time is accurate                                :
* Root mails are being delivered                  :
* Firewall is active                              :
* No unneeded services are reachable (nmap)       :
* Network configuration works (+ipv6)             :
* Syslog/dmesg/oob logs are clean of errors       :

-- Physical Host --

* Root password documented                        :
* Root login works                                :
* OOB password documented                         :
* OOB login works                                 :
* OOB monitored                                   :
* Switch ports are labeled (+ documented)         :
* Hardware is labeled (+ documented in rack docu) :
* Firmware up to date                             :
* RAID level is > 1 and all disks OK              :

Devops at REA – Enhancing the Culture

Last night I discussed "Devops at REA" at the Devops Melbourne meetup. It was a great turnout, with some great conversions afterwards.
The next meetup is tentatively scheduled for January 2012 and I would really like to hear from smaller businesses and how they are practicing devops.

Devops at REA - Enhancing the Culture

You can view more of my presentations on SlideShare

Devops Down Under 2011 Open Space – Kanban in Operations

Slides from my open space talk at Devops Down Under 2011Kanban in operations
View more presentations from Matthew Jones

Devops Down Under 2011 Open Space – Kanban in Operations

Slides from my open space talk at Devops Down Under 2011Kanban in operations
View more presentations from Matthew Jones

Kanban in Operations – Virtual Card Wall

(Cross posted on realestate.com.au Tech Blog)
Three months ago I joined the Site Operations team at realestate.com.au and I was pleased to see that the team were using a card wall for work.
Card Wall
Although the physical card wall proved to be a great place to have stand ups and manage work, it had its problems:
  • We have a distributed team. With operations teams in Italy (casa.it) and Luxembourg (athome.lu), people on devops rotations and working from home on occasion makes it hard for them to participate during stand up.
  • Data associated with cards such as creation timestamps, creators etc. is dependant on users writing it on the cards.
  • Limited external visibility into Site Ops work load. If any one wanted to know what we are currently working on, they would have to head up to the Site Ops area and have a look.
After a discussion with the team, we decided to trial a virtual card wall.

Scope

The trial would run for two weeks, replicating the cards on our physical card wall, with a retrospective and decision to continue at the end.
The trial would not include capturing incidents or deployments and would be light as possible.

Setup

To get the trial up and running as soon as possible, we utilised our existing Jira installation with Greenhopper. The project setup and configuration was kept to a bare minimum.
We created five new issue types, based on the cards on our physical wall – Service Requests, Deployment, Provisioning, Housekeeping and Faults.
Card Types
A week before the trial commenced, we manually imported the cards into Jira and wrote the Jira issue number on the cards. During that week we also duplicated the any new physical cards into Jira. This allowed us to start tracking behaviour before we started the trial.
Card
Our virtual card wall is tactile. Stand ups would now be conducted in front of a Smart Board, which allowed us to interact with Greenhopper using our fingers as the mouse.

The Trial

The trial kicked off on Friday 8th July at 0900, we had our regular stand up with the exception of the new virtual card wall.
Stand up
In addition to Greenhopper, we started a trialling weekly iterations (versions) in Jira – Thursday to Thursday.
Although we weren’t planning the iterations, the option is there for participants to put cards into a few iterations later if the card won’t be actioned for a few weeks.

What works and what doesn’t?

The trial of Greenhopper has been great. The trial has identified a few things that work well, and some that don’t. So what works and what doesn’t?
  • It’s difficult to raise new cards at stand up. It’s a change to our regular process of raising cards at stand up, as we have to create and edit cards before or after stand up. However this has minimised interruptions during stand up, allowing the team to focus on stand up.
  • We are able to raise cards wherever we have access to a web browser and we are not constrained to being in the office.
  • For a few of the stand ups we didn’t have access to the Smart Board and used a projector instead. It felt awkward. Having physical interaction with the card wall definitely enhances the experience. It feels natural for the team to huddle around the card wall, rather than a computer.

What’s next?

So what’s next for the Site Operations Greenhopper integration?
  • First up is to trial the system to the global operations teams with a possible change to our stand up time to a more sensible hour for our European colleagues.
  • Next is to increase transparency into Site Operations current work load. To achieve this we will look into publishing a read-only card wall to the wider company.
  • Start planning work for iterations. We didn’t plan beyond one week during the trail, but we are collecting data on how long cards are taking to cycle through our system.
  • Estimating card size again. Based on  the data collected we should be able to reliably estimate work and compare that to the actual durations.
  • Customise Jira to suit the work flow in Site Operations, including incident management and deployments. This will be an evolutionary process, with an aim to try and keep the work flow as light as possible.
  • The final goal is to investigate integration with other operations systems, such as ZenDesk and Nagios. This would minimise the amount of duplicated for and streamline our work flow.

Kanban in Operations – Virtual Card Wall

(Cross posted on realestate.com.au Tech Blog)
Three months ago I joined the Site Operations team at realestate.com.au and I was pleased to see that the team were using a card wall for work.
Card Wall
Although the physical card wall proved to be a great place to have stand ups and manage work, it had its problems:
  • We have a distributed team. With operations teams in Italy (casa.it) and Luxembourg (athome.lu), people on devops rotations and working from home on occasion makes it hard for them to participate during stand up.
  • Data associated with cards such as creation timestamps, creators etc. is dependant on users writing it on the cards.
  • Limited external visibility into Site Ops work load. If any one wanted to know what we are currently working on, they would have to head up to the Site Ops area and have a look.
After a discussion with the team, we decided to trial a virtual card wall.

Scope

The trial would run for two weeks, replicating the cards on our physical card wall, with a retrospective and decision to continue at the end.
The trial would not include capturing incidents or deployments and would be light as possible.

Setup

To get the trial up and running as soon as possible, we utilised our existing Jira installation with Greenhopper. The project setup and configuration was kept to a bare minimum.
We created five new issue types, based on the cards on our physical wall – Service Requests, Deployment, Provisioning, Housekeeping and Faults.
Card Types
A week before the trial commenced, we manually imported the cards into Jira and wrote the Jira issue number on the cards. During that week we also duplicated the any new physical cards into Jira. This allowed us to start tracking behaviour before we started the trial.
Card
Our virtual card wall is tactile. Stand ups would now be conducted in front of a Smart Board, which allowed us to interact with Greenhopper using our fingers as the mouse.

The Trial

The trial kicked off on Friday 8th July at 0900, we had our regular stand up with the exception of the new virtual card wall.
Stand up
In addition to Greenhopper, we started a trialling weekly iterations (versions) in Jira – Thursday to Thursday.
Although we weren’t planning the iterations, the option is there for participants to put cards into a few iterations later if the card won’t be actioned for a few weeks.

What works and what doesn’t?

The trial of Greenhopper has been great. The trial has identified a few things that work well, and some that don’t. So what works and what doesn’t?
  • It’s difficult to raise new cards at stand up. It’s a change to our regular process of raising cards at stand up, as we have to create and edit cards before or after stand up. However this has minimised interruptions during stand up, allowing the team to focus on stand up.
  • We are able to raise cards wherever we have access to a web browser and we are not constrained to being in the office.
  • For a few of the stand ups we didn’t have access to the Smart Board and used a projector instead. It felt awkward. Having physical interaction with the card wall definitely enhances the experience. It feels natural for the team to huddle around the card wall, rather than a computer.

What’s next?

So what’s next for the Site Operations Greenhopper integration?
  • First up is to trial the system to the global operations teams with a possible change to our stand up time to a more sensible hour for our European colleagues.
  • Next is to increase transparency into Site Operations current work load. To achieve this we will look into publishing a read-only card wall to the wider company.
  • Start planning work for iterations. We didn’t plan beyond one week during the trail, but we are collecting data on how long cards are taking to cycle through our system.
  • Estimating card size again. Based on  the data collected we should be able to reliably estimate work and compare that to the actual durations.
  • Customise Jira to suit the work flow in Site Operations, including incident management and deployments. This will be an evolutionary process, with an aim to try and keep the work flow as light as possible.
  • The final goal is to investigate integration with other operations systems, such as ZenDesk and Nagios. This would minimise the amount of duplicated for and streamline our work flow.

DevOps Presentation Debrief

Last month, I presented DevOps to the folks at work. After speaking with Damon and John from the DevOps Cafe about it, they convinced me to put it up.I have finally got around to posting the slide deck and my recap of the whole presentation.I also did this impromptu talk at the Sydney DevOps meetup in October (Thanks Mick).
The whole DevOps philosophy was well received, but CAMS really hit home with the audience.CultureI think this was the hardest one to push... trying to break habits is hard, but we are getting there. Simply changing the attitude among a few developers has had great results and they can really see a positive change. I have developers interested in issues that were traditionally “my problem”.I think persistence is the key with this one and sadly I think the whole industry needs a culture change.AutomationBeing mostly developers in attendance the ideas of automation really got them excited, so much so I talked to a few of the developers afterwards about some future ideas for automation in the development, test and production cycles.Internally, I am developing a continuous deployment system that auto-deploys the latest code changes to our environments. New to them, not so much to the DevOps world. However the spin on this for us is we are also integrating our metrics and automated tests.MetricsAlthough I have raised this topic with the team before, there was still a still a high level of interest. The difference this time that I demonstrated that metrics can provide an invaluable insight into live environments by showing them two simple graphs of Tomcat hits and the server’s bandwidth.SharingI really believe that sharing has a great position in the whole CAMS idea... It allows the culture, automation and metrics to be all tied together and this was well received among the audience. I was trying to debunk the fear of sharing information, tools or processes, results in being made redundant. Sharing ideas, dashboards, shell accounts etc. actually improves your work and employability.DevOps Life CycleTying all this together I went through a “DevOps Life Cycle” (a spin on  the SDLC) explaining how DevOps can really support the business and our customers. I had some good feedback on this, as I showed how the concepts of DevOps could be applied within projects.Question TimeWrapping up I had a few questions.
I had the typical “How can this be used in my project” questions, which were interesting because our large clients host our software with outsourced IT services companies, so it also raised the question on how DevOps applies to such cases.
Follow UpOne of the directors missed the presentation and reviewed it over the weekend and chased me up for a chat. He was specifically interested in the idea of metrics, and the revival of non-functional requirements for our software projects because NFRs are usually “out of scope”. He could really see the business benefit of DevOps, something I found quite refreshing.DevOps

DevOps Presentation Debrief

Last month, I presented DevOps to the folks at work. After speaking with Damon and John from the DevOps Cafe about it, they convinced me to put it up.I have finally got around to posting the slide deck and my recap of the whole presentation.I also did this impromptu talk at the Sydney DevOps meetup in October (Thanks Mick).
The whole DevOps philosophy was well received, but CAMS really hit home with the audience.CultureI think this was the hardest one to push... trying to break habits is hard, but we are getting there. Simply changing the attitude among a few developers has had great results and they can really see a positive change. I have developers interested in issues that were traditionally “my problem”.I think persistence is the key with this one and sadly I think the whole industry needs a culture change.AutomationBeing mostly developers in attendance the ideas of automation really got them excited, so much so I talked to a few of the developers afterwards about some future ideas for automation in the development, test and production cycles.Internally, I am developing a continuous deployment system that auto-deploys the latest code changes to our environments. New to them, not so much to the DevOps world. However the spin on this for us is we are also integrating our metrics and automated tests.MetricsAlthough I have raised this topic with the team before, there was still a still a high level of interest. The difference this time that I demonstrated that metrics can provide an invaluable insight into live environments by showing them two simple graphs of Tomcat hits and the server’s bandwidth.SharingI really believe that sharing has a great position in the whole CAMS idea... It allows the culture, automation and metrics to be all tied together and this was well received among the audience. I was trying to debunk the fear of sharing information, tools or processes, results in being made redundant. Sharing ideas, dashboards, shell accounts etc. actually improves your work and employability.DevOps Life CycleTying all this together I went through a “DevOps Life Cycle” (a spin on  the SDLC) explaining how DevOps can really support the business and our customers. I had some good feedback on this, as I showed how the concepts of DevOps could be applied within projects.Question TimeWrapping up I had a few questions.
I had the typical “How can this be used in my project” questions, which were interesting because our large clients host our software with outsourced IT services companies, so it also raised the question on how DevOps applies to such cases.
Follow UpOne of the directors missed the presentation and reviewed it over the weekend and chased me up for a chat. He was specifically interested in the idea of metrics, and the revival of non-functional requirements for our software projects because NFRs are usually “out of scope”. He could really see the business benefit of DevOps, something I found quite refreshing.DevOps

Dev and Ops Cooperation

John Allspaw and Paul Hammond did a great presentation at Velocity 2009 about the tools and culture at Flickr, which enable them to do 10+ deploys per day. My favorite quote is: Ops’ job is NOT to keep the site stable and fast [but] Ops’ job is it to enable the business (this is the dev’s job too) The [...] Related posts:20...