↓ Archives ↓

Category → infrastructure

Zen and The Art Of Infrastructure Maintenance

There’s a famous passage I’ve quoted from time to time when I see someone “stuck” with a problem, flailing potential fixes and solutions wildly in hope that one of them does the trick. “Assembly of Japanese bicycle require great peace of mind.” Just like that. Improper grammar and all, quoted verbatim from Robert Pirsig’s philosophical classic Zen and the Art of Motorcycle Maintenance. The quote is in reference to the first printed words on a set of technical instructions, which the narrator discusses with friends and colleagues on a long-distance motorcycle trip. And the response I get is the same every time, likely the same response you’re having right now… What in the world are you talking about? Well good, now I have your attention. And I’d be willing to bet that attention is on 1) the ludicrous loss of translation from Japanese to English and 2) debating whether or not something as seemingly superficial as “peace of mind” is really required to assemble a bicycle. Language barrier aside, let’s...

Bootstrapping the Infrastructure Coders Meetup


Earlier this year, David Lutz and I were discussing the lack of an infrastructure as code meetup in Melbourne. We sat down and mapped out our vision for the meetup:

  • Regular meetup - Monthly meetups, held in the second week of every month.
  • Technology agnostic - No preference on tools. We want all conversations, from concept to implementation.
  • Fresh, relevant content - Being techno-agnostic, this also keeps the content fresh so we had to ensure it is relevant to the meetup.
  • Interesting venues - Bad venues can break a meetup. Ensure that the location is comfortable and central to the members.
  • Minimal sponsorship - Sponsorship is great, but it doesn't mean editorial control. We will accept sponsors, however no sales or marketing talks.

Having established our vision, we needed to prove if Melbournites wanted such a meetup, so we created an Infrastructure Coders meetup and promoted it via Twitter. We had a great response, but it wasn't enough. We approached the organiser of Devops Melbourne for a short speaker slot to promote Infrastructure Coders and within a few days we doubled our members.

David and I soon realised we had reached critical mass, so we had a discussion around hosting the meetup. We required a hosting strategy that allowed the meetup to remain free for members; bootstrapping the meetup ourselves, we decided we would find a host for each meetup. The host would be an organisation in Melbourne that recognised the relevance of infrastructure as code, with the value being - we organise the meetup and speakers; they provide food, drinks and a space. Hosts are given a speaking spot, provided it is on topic and it is an opportunity to promote their company. We needed to test this concept, so I approached my employer to host the inaugural meetup.

The date was set, drinks were purchased and food was ordered. The first meetup was small and informal, so we took the opportunity for everyone to introduce themselves and what they wanted out of Infrastructure Coders. Afterwards we retired to the kitchen for dinner, where discussions on infrastructure as code continued.
We marked the meetup a success and David immediately organised the second host, 99designs.

So what did we learn from this experience?

  • Have a vision - What is the goal of the meetup? Where do you want to take it?
  • Know your audience - What does your audience want? What will they take away from your meetup?
  • Validate your meetup - Create an online space where people can register their interest. We used Meetup.com (pricing available at their website), but other online event tools, such as Eventbrite would work too.
  • Market your meetup - Twitter is a great way of getting the word out. Register an account for your meetup and decide on an hashtag. Go to other meetups and promote your meetup.
  • Gather feedback - Feedback will allow the meetup to improve organically with your audience. We have had some great, our members really enjoyed going into workplaces of companies in Melbourne. This also allowed the employees of that organisation to stay back and listen to a few talks before heading home.

From the initial concept to now - we have hosted four meetups, two scheduled for the upcoming months and I am having discussions with organisations for meetups which will book us out until early next year. In addition, I have had discussions with Scott Lowe, to start Infrastructure Coders Denver.

If you are interested in hosting Infrastructure Coders or starting a new meetup, please contact me.

A Systems Policy

Recently I talked to a couple of friends, which all wailed quite a bit about their operations or internal IT departments.

Most of these teams had to fight with some very basic things. They lacked a decent monitoring system or monitoring at all. They didn’t deploy systems, they installed it by hand. Systems where not documented etc.

So here are some guidelines, I try to aspire with my team. This is by far not a complete list of things you need to run successful operations but it should give you a fair hint about what it takes.

Also please note that you might want to adapt your own policy a bit to fit your needs. I’m coming from the web industry, but we still run our own hardware, so this might especially not fit a typical cloud based infrastructure.

Systems

A System is considered the lowest part of our infrastructure and services. All rules defined here, should be considered in all other policies.

A system….

  • is documented at a central location.
  • is monitored and being graphed.
  • is being backuped.
  • is updated regularly.
  • has a defined production level. (spare, pre-production, production)
  • has a defined owner and maintainer.
  • has a predefined maintenance level.
  • has a predefined availability.
  • has a physical location.
  • has a unique name, which is resolvable by DNS.
  • has only required software installed.
  • was installed with all currently available updates.
  • was inspected and approved by a second man before being released to production.
  • All parts are functional at any time. All Faults get documented RFN and repaired as soon as possible.
  • There are always 2+ people informed about it.
  • Network access vectors are defined.
  • Configurations are not only available locally (including scripts).
  • Sensible data gets protected.

Hardware

A piece of hardware can be anything from a big server to a small temperature sensor in your server room.

A piece of hardware…

  • has a maintenance contract or spare hardware available.
  • has got an inventory number.
  • is labeled (hostname + inventory).
  • is physically secure (environmental! and mechanical access control).
  • has got a bill, which is documented at a central location.
  • should have redundant power supplies.
  • should have some kind of out of band management solution (OOB).
  • has at least one power circuit connected to an electronic circuit protected by an uninterruptible power supply (USV).

All tools needed to open and repair any part of the system are available.

Servers

A server…

  • has at least two disks configured with RAID >= 1.
  • has at least two separate network interface cards (NICs).
  • has all RAID controllers backed with battery backed write caches (BBWC).
  • was dimensioned with adequate future-proof hardware.
  • has a lifetime of 2+ years.

Switches

A switch…

  • is manage- or configurable.
  • is supported by the configuration backup software in use (e.g. RANCID)
  • provides the following protocols: STP, SNMP, IPv6 support (mgmt+multicast), RADIUS for AAA
  • does not forward the default VLAN (1) on it’s uplink/trunk ports.
  • does have a description for every port in use (including hostname and interface, e.g.: server01#eth0, server01#oob, switch03#24)
  • does not have any enabled, unused ports: set them to disabled and remove any other configuration for this port.
  • blocks or does not forward any discovery protocols on it’s user ports.
  • is using AAA for authenticating users.
  • logs to a central syslog server.

Operating Systems

An operating system (OS) is considered as everything running on a server or instance, to support a service or an application.

An Operating System…

  • uses OS-CHOICE-HERE/stable as default distribution on servers.
  • uses OS-CHOICE-HERE as default on clients.
  • is rebooting without any manual interventions.
  • provides access by SSH.
  • does not permit root login via SSH.
  • has a root password set.
  • has the current time, synchronized with a time server and uses TIMEZONE-CHOICE-HERE as time zone.
  • can resolve internal and internet names via DNS.
  • installs software by packages.
  • installs packages from a central internal repository and the official distribution repositories.
  • software installed by packages should conform to the FHS.
  • software not installed by packages should be installed by a reproducible deployment process.
  • has sane defaults set, for user and process environments (locales, shells, screen, got some handy tools, etc.).
  • should not provide typical compiler tools (gcc, build-essential).
  • provides a manageable AAA concept (e.g. automated provisioning and de-provisioning of staff users).
  • sends mails destinated for root to a central location.
  • provides a local mailer.

Hostnames

Hostnames exist to identify every part of your infrastructure uniquely. They are used to refer to systems in your configurations and in discussions. You should think about a naming convention, but here are some rough guidelines.

Hostnames …

  • have to be unique.
  • have to end with a number, which should never be reused and always be incremented.

Services

A service is considered as everything running on a server’s operating system, to provide continuous functionality (e.g. a script or an application).

A service…

  • does only log errors and auditing information. Application services may as well log more information (e.g. Apache access log).
  • has defined log retention times.
  • logs to syslog unless it’s not possible.
  • is authenticating only on secure connections.
  • has an adequate and future-proof dimensioned datastore.
  • was deployed in a reproducible way.

Networks

A network is considered any part of infrastructure, which is used to interconnect servers or systems. (Layer 1,2,3,4,…)

A Network…

  • has clear entry and routing points.
  • has a diagram which describes access vectors, the logical and physical setup.
  • is deployed in adequate and future-proof dimensions (vlans, ip addresses, bandwidth).
  • uses structured cabling.
  • there is no cross-cabling, except for very rare situations (e.g. HA cabling).
  • should not be used for multiple purposes at least not share one of the following classifications.
    ClassDescription
    net Internet/upstream network
    mgmt Management network (monitoring, remote access)
    traffic Site local traffic network
    backup Traffic network for backups
    voip Voip Telephony network
    clients A network with client workstations.
    devel A network with development machines.
    staging A network with staging equipment.


  • OOBs are easy to reach, even in case of an outage.
  • VLAN-IDs are considered global, create a list.
  • All VLAN-IDs below 99 are switch-local.
  • VLANs have a name and a location.
  • All address space is considered global (vlans, ip- and mac addresses, including RFC1918)

To round up my article, here is a example checklist we use to peer review new systems:

Example Review Checklist

Every newly deployed host or instance should undergo a peer-review process. The checklist below will provide you with a couple of base acceptance criteria and is going to ensure a certain level of quality. Give it to any other sysadmin and ask him or her to check the system, before it’s put into production.

* DNS works (including reverse dns)               :
* SSH login works                                 :
* Host+services monitored                         :
* Host+services graphed                           :
* All Filesystems backuped                        :
* Database dumps                                  :
* All Updates installed                           :
* Host in HostDoc                                 : 
* Puppet works                                    :
* Time is accurate                                :
* Root mails are being delivered                  :
* Firewall is active                              :
* No unneeded services are reachable (nmap)       :
* Network configuration works (+ipv6)             :
* Syslog/dmesg/oob logs are clean of errors       :

-- Physical Host --

* Root password documented                        :
* Root login works                                :
* OOB password documented                         :
* OOB login works                                 :
* OOB monitored                                   :
* Switch ports are labeled (+ documented)         :
* Hardware is labeled (+ documented in rack docu) :
* Firmware up to date                             :
* RAID level is > 1 and all disks OK              :

Appliance or Not Appliance

That's the question Xavier asks in his blog entry titled
Security: DIY or Plug’n'Play

To me the answer is simple, most of the appliances I ran into so far have no way of configuring them apart from the ugly webgui they ship with their device. That means that I can't integrate them with the configuration management framework I have in place for the rest of the infrastructure. There is no way to automatically modify e.g firewall rules together with the relocation of a service which does happen automatically, and there is always some kind of manual interaction required. Applicances tend to sit on a island, either stay un managed ( be honest when's the last time you upgraded the firmware of that terminal server ? ) , or take a lot of additional efort to manage manually. They require yet another set of tools than the set you are already using to manage your network.
They don't integrate with your backup strategy, and don't tell me they all come with perfect MIB's.

There's other arguments one could bring up against appliances, obviously people can spread fud about some organisation alledgedly paying people to put backdoors in certain operation systems.. so why would they not pay people to put backdoors in appliances , they don't even need to hide them in there .. but my main concern is manageability .. and only a web gui to manage the box to me just means that the vendor hates me and dooesn't want my business

A good Appliance (either security or other type) needs to provide me an API that I can use to configure it, in all other cases I prefer a DIY platform, as I can keep it in line with all my other tools, config mgmtn, deployment, upgrade strategies etc.

Mabye a last question for Xavier to finish my reply ... I`m wondering how Xavier thinks he kan achieve High-availability by using a Virtual environment for Virtual Appliances that are not cluster aware using the virtual environment. A fake comfortable feeling of higher availability , maybe.. but High Availability that I'd like to see.

DevOps Presentation Debrief

Last month, I presented DevOps to the folks at work. After speaking with Damon and John from the DevOps Cafe about it, they convinced me to put it up.I have finally got around to posting the slide deck and my recap of the whole presentation.I also did this impromptu talk at the Sydney DevOps meetup in October (Thanks Mick).
The whole DevOps philosophy was well received, but CAMS really hit home with the audience.CultureI think this was the hardest one to push... trying to break habits is hard, but we are getting there. Simply changing the attitude among a few developers has had great results and they can really see a positive change. I have developers interested in issues that were traditionally “my problem”.I think persistence is the key with this one and sadly I think the whole industry needs a culture change.AutomationBeing mostly developers in attendance the ideas of automation really got them excited, so much so I talked to a few of the developers afterwards about some future ideas for automation in the development, test and production cycles.Internally, I am developing a continuous deployment system that auto-deploys the latest code changes to our environments. New to them, not so much to the DevOps world. However the spin on this for us is we are also integrating our metrics and automated tests.MetricsAlthough I have raised this topic with the team before, there was still a still a high level of interest. The difference this time that I demonstrated that metrics can provide an invaluable insight into live environments by showing them two simple graphs of Tomcat hits and the server’s bandwidth.SharingI really believe that sharing has a great position in the whole CAMS idea... It allows the culture, automation and metrics to be all tied together and this was well received among the audience. I was trying to debunk the fear of sharing information, tools or processes, results in being made redundant. Sharing ideas, dashboards, shell accounts etc. actually improves your work and employability.DevOps Life CycleTying all this together I went through a “DevOps Life Cycle” (a spin on  the SDLC) explaining how DevOps can really support the business and our customers. I had some good feedback on this, as I showed how the concepts of DevOps could be applied within projects.Question TimeWrapping up I had a few questions.
I had the typical “How can this be used in my project” questions, which were interesting because our large clients host our software with outsourced IT services companies, so it also raised the question on how DevOps applies to such cases.
Follow UpOne of the directors missed the presentation and reviewed it over the weekend and chased me up for a chat. He was specifically interested in the idea of metrics, and the revival of non-functional requirements for our software projects because NFRs are usually “out of scope”. He could really see the business benefit of DevOps, something I found quite refreshing.DevOps

DevOps Presentation Debrief

Last month, I presented DevOps to the folks at work. After speaking with Damon and John from the DevOps Cafe about it, they convinced me to put it up.I have finally got around to posting the slide deck and my recap of the whole presentation.I also did this impromptu talk at the Sydney DevOps meetup in October (Thanks Mick).
The whole DevOps philosophy was well received, but CAMS really hit home with the audience.CultureI think this was the hardest one to push... trying to break habits is hard, but we are getting there. Simply changing the attitude among a few developers has had great results and they can really see a positive change. I have developers interested in issues that were traditionally “my problem”.I think persistence is the key with this one and sadly I think the whole industry needs a culture change.AutomationBeing mostly developers in attendance the ideas of automation really got them excited, so much so I talked to a few of the developers afterwards about some future ideas for automation in the development, test and production cycles.Internally, I am developing a continuous deployment system that auto-deploys the latest code changes to our environments. New to them, not so much to the DevOps world. However the spin on this for us is we are also integrating our metrics and automated tests.MetricsAlthough I have raised this topic with the team before, there was still a still a high level of interest. The difference this time that I demonstrated that metrics can provide an invaluable insight into live environments by showing them two simple graphs of Tomcat hits and the server’s bandwidth.SharingI really believe that sharing has a great position in the whole CAMS idea... It allows the culture, automation and metrics to be all tied together and this was well received among the audience. I was trying to debunk the fear of sharing information, tools or processes, results in being made redundant. Sharing ideas, dashboards, shell accounts etc. actually improves your work and employability.DevOps Life CycleTying all this together I went through a “DevOps Life Cycle” (a spin on  the SDLC) explaining how DevOps can really support the business and our customers. I had some good feedback on this, as I showed how the concepts of DevOps could be applied within projects.Question TimeWrapping up I had a few questions.
I had the typical “How can this be used in my project” questions, which were interesting because our large clients host our software with outsourced IT services companies, so it also raised the question on how DevOps applies to such cases.
Follow UpOne of the directors missed the presentation and reviewed it over the weekend and chased me up for a chat. He was specifically interested in the idea of metrics, and the revival of non-functional requirements for our software projects because NFRs are usually “out of scope”. He could really see the business benefit of DevOps, something I found quite refreshing.DevOps

Coding an Infrastructure Test First

Now that we outlined the programming languages for automating shell scripting, virtual machine creation ,network provisioning and os installation and beyond, I bet you as a devops are eager start writing your infrastructure code.

After some time chances are that you will end up with lots and lots of scripts executing in sequence. And then when you change something in a script the whole sequence will fail and you'll have a hard time looking for what caused the problem. A better approach for writing your code is to practice Test Driven Development.

Test Driven Development Automation

In short before writing any code, you first write a test for the code you are writing. Then you run your tests and see that the new test fails.(RED). It is only then that you start writing your code or change your existing code (REFACTOR). When you think the code is done, you run your tests again and see them succeed (GREEN). And then you continue to use this cycle to grow your code. It is important that you chance your code in small increments. For more infor see Test first guidelines.

Benefits for the sysadmin

So how can this help you as a sysadmin? Isn't this more of developer thing? And the answers is a big NO:

  • can you remember the last time when you had to apply patches or config file changes to a system. And did you have that fingers crossed feeling? Wouldn't it be great that you could install a patch and run a series of tests to see if everything behaved the way it should?
  • when you get audited : how can you show that the machines you're running comply to your installation guides
  • writing these tests also helps in sharing the knowledge and repeating the validation process every time. Even without your rockstar sysadmin being around
  • the incremental approach also helps in systems overdesign. As project complexity grows, you may notice that writing automated tests gets harder to do. This is your early warning system of overcomplicated design. Simplify the design until tests become easy to write again, and maintain this simplicity over the course of the project.

Yes, but won't this slow me down? Well this is exactly the reaction most developers have to this process. But to me the benefits should be clear, do you rather go for the fingers crossed go live or the verified state. You will have to find the right balance between writing all tests and writing no tests.

If you are working on a new project that tries to get something running as quickly as possible as a one shot, you'll probably be under pressure to deliver. It will take some time to reach the skills to get the automation and the tests in your fingers. Still a good way of convincing management is by explaining them that these tests not only for the project phase but can be used during the whole maintenance period. This definitely increased the ROI of writing these tests.

Setup / Teardown

Part of a test there usually is a setup and a tear-down part. The idea is to create a state under which you start performing your tests. For applications f.i. this would be to put the right stuff in the database, set the right variables. So how does this translate to machines? Most of the virtualization solutions allow you to do easily take snapshots of both your memory state (savestate) and your disks. So you can easily recreate a certain point in time to start your tests by cloning your systems and running some scripts to change the state (f.i. filling a disk, killing a process) . Another approach could be to take snapshots of your disks using filesystems like ZFS, LVM that easily let you take snapshots.

This also helps during the coding of your infrastructure:

  • you take a snapshot of your current state
  • you run your code
  • see if the test succeeds
  • if not OK rollback , if OK save the new state

Example: a Webserver Test First

In the following example I will describe the setup of a webserver which serves static pages. Note that there is no standard development involved here. It's all about pre-packaged software.

Step 1: Defining the virtual machine

In this step you would define the hardware of your virtual machine: number of CPU's, memory, network interfaces, mac addresses, disks, ... So how do we test this? In the days of physical hardware, the way to verify that systems had all the hardware in it, we booted up a CD and verified using commands that the system contained the correct number of disks.

To avoid writing your own boot CD , you could use something similar to sysrescueCD. It has a feature called autorun that allows you to boot up the disk and execute scripts that are either on a floppy, disk, NFS or Samba share or HTTP server. If you mount this as a virtual CD in your virtual machine, it can boot up the virtual machine and execute test to test if the definition was according to what was specified.

Step 2: Prepare IP, DNS, DHCP, TFTP

Next step is provisioning the network information for your virtual machine. In order to test this, we can easily use the same boot CD approach to verify this via scripts using dhclient, dig, tftpclient. Some might argue that this test is better done from within the OS. Off course testing is when the OS is installed is a more complete tests. Doing this test separate from the installed OS, lets you better distinguish where the error occurs. Is it a problem of the OS driver that there is no IP address or is is a problem with the definition of the DHCP/DNS.

Step 3: Minimal Install of OS

This is usually done by defining a kickstart/jumpstart template. It would contain the disks partitioning, network configuration, the minimal packages and patches, a set of minimal services enabled (ssh, puppet), selinux enabled , . As you see there is a lot more that can be tested here:

  • Is the swap activated correctly
  • Is all memory seen from the OS
  • check if disks partitioning is ok
  • check if disks are mounted correctly
  • Is it 64/32 bit
  • Are permission set right
  • Is SElinux activated
  • is the NFS share exported: showmount -e
  • IP: are the interfaces up an did they get the correct settings
  • do some DNS lookups to see if that works
  • Ping the router to see if network is alive
  • Verify the syntax of your sendmail.cf
  • See if the processes are running (sshd, puppetd) : ps -ef |
  • Check the listeners : netstat -an|grep LISTEN
  • Do a test login with SSH
  • Running NMAP to see if no other services are activated
  • run nessus to check vulnerabilities

Up until now these tests are simple checks , and probably most people will do similar things in their monitoring. I've talked about testing being more then monitoring. Aside from these simple checks you can complementing your monitoring to run scenarios. Also destructive tests like failing a disk or bringing down an interface are probably not the best thing to do in production monitoring ;-)

  • test if IP Bonding by executing a failover
  • Verify that syslog works by sending a log request
  • test if your raid system works by killing a disk
  • test if your self healing works by killing a process
  • test a reboot scenario
  • test if your DNS failover works by using iptables to block access to the first DNS server
  • test your backup/restore scenario ;-)

So aren't we re-testing the packages here? The problem here is packages are often tested in isolation from each other and they can only test a limited number of setups. That means it still makes sense to repeat some tests so see if YOUR combination of things actually works.

Step 4: Apply recipes for the webserver

We now have a tested minimal OS running with a configuration mgt system active. Next in line it applying recipes. While discussing these with a number of people , I heard a lot that you don't need tests because these tools work in a declarative mode: if something doesn't work then it's either an error in your recipe or an error in the configuration management software.

I personally disagree, the argument is similar to why we test even if individual packages are tested: you can write a beautiful recipe to install a webserver but maybe the firewall is preventing you access, maybe your SELinux is blocking things. Or you install a bad apache config file so that it uses the wrong directory to serve.

Also you can add scenario testing here :

  • by running load and see if it actually spawns the number of processes you specified
  • check for loadbalancer pages are available
  • kill the http daemon and see if it recovers
  • check if caching works ok by downloading the file one
  • check if HTTP/compression works
  • check if lastmodified/ headers are set correctly
  • check if log rotation works ok

Testing Frameworks

Programming languages have developed a lot of frameworks over time to help in writing tests. When coding your infrastructure in these languages check what's available. Still there is no test library that is specific to systems testing. The closest are test frameworks for HTTP testing.

Currently this is an emerging field. There are already a lot of examples in the wild. As we adopt more the idea of testable infrastructure, these frameworks will emerge. The most notable is cucumber-nagios written by Lindsay Holmwood. It brings http testing closer to monitoring and into the sysadmin world. Now you can reuse your tests written in cucumber in your monitoring environment.

After a discussion at devopsdays 09 Lindsay is starting on a similar thing that allows to integrated ssh scripting in this framework, or what he calls Behavior driven infrastructure through cucumber And recently he announced on the agile system administration mailing list that he's joining forces with Adam Jacob of Opscode. So definitely more to come!