↓ Archives ↓

Archive → March, 2010

gnome-do trick

I use gnome-do. It’s fairly useful. One quick trick: press SHIFT – <ENTER> instead of plain <ENTER> to launch without closing the launcher box. Therefore, if you want three terminals, type: <SUPER> – <SPACE>, t(erminal), (<SHIFT> – <ENTER> X 2), <ENTER>.


11 days till Loadays

That's right .. only 11 more ...
The schedule looks promising, there will be some devops juice, some open spaces, some tutorials, som regular talks .. it really looks promising ... the schedule is packed ,

Apart from the talks, tutorials and open spaces there's also the
Pizza party and the Beer event on saturday ...

No need to register .. just show up ..

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/995

#Devops / Ruby Meetup , Antwerp, April 8, 2010

Joshua Timberman will be in town, (Antwerpen) that is, for Loadays as he is arriving on thursday Botchagalupe suggested we should have a Devops / Ruby get together.

So I'm dutyfully announcing the Devops/Ruby meetup next thursday april 8th, in Antwerp

The plan is to meet up for beers and chatter in our favourite Antwerp geek pub in , Kulminator , Vleminckveld 32 , Antwerp , around 20h00 ish..

Topics will be devops, ruby and much more :)

No need to register .. just show up ..

If for some reason the Kulminator is to crowdy, smokey, closedy you should be able to find us next door in the Zeppos :)

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/994

lecturing and git-bisect

I was recently asked to give a lecture for the PRELUDE series at McGill. Here was my abstract:

I don’t like computers, and neither should you.

We spend too much time figuring out how to talk to them, instead of having them figure out how to understand us.

There’s a big discontinuity between what software is providing, and the killer features we want!

We’re not completely lost though. There are a lot of good tools and methodologies available!

Until the feature gap closes, let me introduce you to some of these tools, and show you how I use the computer.

I spoke about a variety of topics with the intention of filling in everyone’s knowledge about the useful tools available to users and developers. I included a section about git-bisect and have posted the script in the examples section of the bash-tutor tarball. It is now available for you to download and share.

I hope everyone enjoyed the lecture, and I always appreciate feedback!


scary cool bash scripting inside a Makefile

Makefiles are both scary and wonderful. When both these adjectives are involved, it often makes for interesting hacking. This is likely the reason I use bash.

In any case, I digress, back to real work. I use Makefiles as a general purpose tool to launch any of a number of shell scripts which I use to maintain my code, and instead of actually having external shell scripts, I just build any necessary bash right into the Makefile.

One benefit of all this is that when you type “Make <target>”, the <target> can actually autocomplete which makes your shell experience that much more friendly.

In any case, let me show you the code in question. Please note the double $$ for shell execution and for variable referencing. The calls to rsync and sort make me pleased.

rsync -avz --include=*$(EXT) --exclude='*' --delete dist/ $(WWW)
# empty the file
echo -n '' > $(METADATA)
cd $(WWW);
for i in *$(EXT); do
b=$$(basename $$i $(EXT));
V=$$(echo -n $$(basename "`echo -n "$$b" | rev`"
"`echo -n "$(NAME)-" | rev`") | rev);
echo $(NAME) $$V $$i >> $(METADATA);
done;
sort -V -k 2 -o $(METADATA) $(METADATA) # sort by version key

The full Makefile can be found inside of the bash-tutor tarball.


Using Kanban For DevOps Projects

This is a guest post by Robert Dempsey, CEO & Founder of Atlantic Dominion Solutions. He helps clients with agile training and builds products like scrum’d. I wish I had known about Kanban when I was a network administrator. It would have helped me immensely in terms of prioritization of work and making everything we [...] Related...

DevOps – Operations to Developers

This is part 2 in a general set of discussions on DevOps. Part 1 is here

Production
I have a general rule I've lived by that has served me well and it's NSFW. I learned it in these exact words from a manager many years ago:

"Don't f*** with production"

Production is sacrosanct. You can dick around with anything else but if it's critical to business operations, don't mess with it without good reason and without an audit trail. There is nothing more frustrating than trying to diagnose an outage because someone did something they THOUGHT was irrelevant (like a DNS change - I speak from experience) and causing a two hour outage of a critical system. It's even more frustrating when there's no audit trail of WHAT was done so that it can be undone. Meanwhile, you've got 20 different concerned parties calling you every five minutes asking "are we there yet?". How much development work would get done if you operated under the same interrupt driven environment?

Change Control
Yes, it's a hassle and boring and not very rockstar but it's not only critical but sometimes it's the law.

Side note: I pretty much hate meetings in general but they do serve a purpose. My main frustration is that meetings take away time where work could actually be getting done. They always devolve into a glorified gossip session. What should have taken 15 minutes to discuss ends up taking an hour as conversations that started while waiting for that last person to show up carry over into meeting proper. Sadly the person who is late is usually Red Leader and we can't seem to stay on target. Everyone has something they would rather be doing and usually it's something that will actually accomplish something rather than the stupid meeting.

The exception for me, has always been change control meetings. I typically enjoy those because that's when things happen. We're finally going to release your cool new feature into production that you've spent a month developing and fine tuning. Of course, this is when we find out that you neglected to mention that you needed firewall rules to go along with it. This is when we find out exactly what that new table is going to be used for and that we MIGHT want to put it in its own bufferpool. All of the things you didn't think of?We bring them to the surface in these meetings because these are pain points we've seen in the past. We think of these things.

Auditing
As mentioned in production, typically we don't have the benefit of looking over changes in source control. We can't check a physical object into SVN. Sure, there are amazing products like Puppet and Cfengine that make managing server configurations easier. We have applications that can track changes. We have applications that map our switch ports but it's simply not that easy for us to track down what changed.

Your application is encapsulated in that way. You know what changed, who changed it and (with appropriate comments) WHY it was changed.

Meanwhile a DNS change may have happened, a VLAN change, a DAS change...you name it. Production isn't just your application. It's all the moving parts underneath that power it. That application that you developed that is tested on a single server doesn't always account for the database being on a different machine or the firewall rules associated with it.

Yes, we'd love to have a preproduction environment that mimics production but that's not always an option. We have to have an audit trail. Things have to be repeatable. So no, we can't just change a line in a jsp for you to fix a bug that didn't get caught in testing. It would take us longer to do that on 10 servers than if we just pushed a new build.

Outages
Outages are bad, mmkay? You probably won't lose your job over a bug but I've had to deal with someone being fired because he didn't follow the process and caused an outage. It sucks but we're the one who gets the phone call at 2AM when something is amiss.

And even AFTER the outage, we have to fill out Root Cause Analysis reports sometimes after being up for 24 hours straight fixing a serious issue. You can either write a unit test for a bit of code or you can keep fixing the same bug after every release. We'd prefer you write the unit test, personally.

I know all of these things make us look like a slow, unmoving beast. I know you hate sitting in meeting after meeting explaining that the bug will be fixed just as soon as ops pushes the code. I know that we get pissy and blame you for everything that goes wrong with an application. We're sorry. We're just running on 2 hours of sleep in three days getting the new hardware installed for your application that someone thinks has to go online yesterday. Meanwhile, we're dealing with a full disk on this server and a flaky network connection on another. Cut us some slack.

DevOps and NoSQL – bad naming leads to confusion

I've recently started following a few new topics (where recently means over the past year). Both of them have the potential to be paradigm shifts and, unfortunately, both have somewhat vague names that evoke responses on both sides of the issue.

The one I'm going to focus on right now is DevOps. I intend on doing another post on NoSQL but that all depends on how much free time I can finagle between setting up the nest for baby number 2 and work projects.

Background
I should clarify my background because that plays a large part in how I perceive both of these issues. I'm a systems engineer. No, I don't have a degree in engineering but I wouldn't call the work I've done over the years any less than that. I've been the intermediary between DBAs and Developers. I've spent 20+ hours on my feet in a frigid datacenter racking servers. I've done high-level architecture of disparate system integration. I've done low-level implementation of disparate system integration. I've been up at 4AM to do deploys of new code during the 30 minute maintenance window. I've been the guy getting the pages and been the guy calling people who we're supposed to get the pages.

I've been in big shops and small shops. I've been responsible for systems that pass millions of dollars and systems that are critical to education.

I don't say all this to toot my own horn. It's just background that is relevant to the discussion.

DevOps
So what's this DevOps thing that people keep throwing around? Well there are tons of opinions and all of them are like certain sphincter muscles. Not one is entirely on the money but the background work has been done here:


So what is it? I think at the core it's about closely integrating the "SysOp" silo with the "Developer" silo as a methodology. But why is this important?

SysOps have always been apart from the rest of the IT department in a sense. While many groups have frequent overlapping areas, the operations team has the final responsibility. As I like to put it, they're the folks getting the phone call. Unless the organization is small, most developers aren't even in the loop unless a bug report is filed after an outage. As it was put elsewhere, many times software is thrown "over the wall" to be deployed. But why is this? I think that's key to the whole issue.

Roles, Responsibilities and Titles
I'm not a stickler for titles. I've held many over the years for Administrator to Director. In one interesting case, I was given a title (and the subsequent responsibility) simply for the purpose of interacting with a client who had firm opinions about only interfacing with someone at the same level. This didn't take away any responsibilities; only added to them. Titles, roles and responsibilities are all different things.

In this way, the organizational title for "IT Operations" denotes a clear differentiator from "Developer". There are certain expectations from your operations team. Production stays stable, for instance. Many times the goals of the Operations team are in direct opposition to those of the Development team. Make no mistake, however. The developers are part of revenue generation while those of operations are not. Operations exists as fire fighters. If Operations is doing its job properly, they aren't actually doing much of their primary responsibility. They have quite a bit of downtime.

So why is there a need for a DevOps movement?
I think on one hand, there is an increasing frustration from the end-user (in this case development) in its interaction with operations. Development methodologies are changing rapidly. Some changes are for the better (less bugs, more testing) while others create friction with how a production environment operates (frequent releases). Another aspect is people transitioning from one role to the other. You have people moving into development from an operations background and vice versa. People change. They discover that they enjoy X more than Y. With each of these transitions, a mindset and attitude is brought along. An Ego.

The developer who moves into production operations laments the slow sluggish pace at which things move. The operations guy who moves into development loves the fast and fluid nature of Agile development. Both feel the need to reconcile the two worlds thinking they can impart some sort of wisdom from one side of which the other was not aware.

Additionally, in times where the leanest team that is first to market often wins many people are wearing multiple hats. See the rise of IaaS (Infrastructure as a Service), Amazon Web Services, NoSQL and other technologies where traditional roles are eliminated.

Both sides have a lot to learn from each other and both sides need to understand the constraints each team has. This is where I feel DevOps has the most to offer as an ideal. Integrating operations into development and letting development be a part of operations. The specifics are still up in the air but I think there are some key areas that each side needs to understand about the other. I'll follow those up in the next post to for logical grouping purposes.

As always, comments are welcome!

Infrastructure testing with MCollective and Cucumber

Some time ago I showed some sample code I had for driving MCollective with Cucumber. Today I’ll show how I did that with SimpleRPC.

Cucumber is a testing framework, it might not be the perfect fit for systems scripting but you can achieve a lot if you bend it a bit to your will. Ultimately I am building up to using it for testing, but we need to start with how to drive MCollective first.

The basic idea is that you wrote a SimpleRPC Agent for your needs like the one I showed here. The specific agent has a number of tasks it can perform:

  • Install, Uninstall and Update Packages
  • Query NRPE status for a specific NRPE command
  • Start, Stop and Restart Services

These features are all baked into a single agent, perfect for driving from a set of Cucumber features. The sample I will show here is only driving the IPTables agent since that code is public and visible.

First I’ll show the feature I want to build, we’re still concerned with driving the agent here not testing so much – though the steps are tested and idempotent:

Feature: Manage the iptables firewall
 
    Background:
    Given the load balancer has ip address 192.168.1.1
    And I want to update hosts with class /dev_server/
    And I want to update hosts with fact country=de
    And I want to pre-discover the nodes to manage
 
    Scenario: Manage the firewall
        When I block the load balancer
        Then traffic from the load balancer should be blocked
 
        # other tasks like package management, service restarts
        # and monitor tasks would go here
 
        When I unblock the load balancer
        Then traffic from the load balancer should be unblocked

To realize the above we’ll need some setup code that fires up our RPC client and manage options in a single place, we’ll place this in in support/env.rb:

require 'mcollective'
 
World(MCollective::RPC)
 
Before do
    @options = {:disctimeout => 2,
                :timeout     => 5,
                :verbose     => false,
                :filter      => {"identity"=>[], "fact"=>[], "agent"=>[], "cf_class"=>[]},
                :config      => "etc/client.cfg"}
 
    @iptables = rpcclient("iptables", :options => @options)
    @iptables.progress = false
end

First we load up the MCollective code and install it into the Cucumber World, this achieves more or less what include MCollective::RPC would in a Cucumber friendly way.

We then set some sane default options and start our RPC client.

Now we can go onto writing some steps, we store these in step_definitions/mcollective_steps.rb, first we want to capture some data like the load balancer IP and filters:

Given /^the (.+) has ip address (\d+\.\d+\.\d+\.\d+)$/ do |device, ip|
    @ips = {} unless @ips
 
    @ips[device] = ip
end
 
Given /I want to update hosts with fact (.+)=(.+)$/ do |fact, value|
    @iptables.fact_filter fact, value
end
 
Given /I want to update hosts with class (.+)$/ do |klass|
    @iptables.class_filter klass
end
 
Given /I want to pre-discover the nodes to manage/ do
    @iptables.discover
 
    raise("Did not find any nodes to manage") if @iptables.discovered.size == 0
end

Here we’re just creating a table of device names to ips and we manipulate the MCollective Filters. Finally we do a discover and we check that we are actually matching any hosts. If your filters were not matching any nodes the cucumber run would bail out.

Now we want to first do the work to block and unblock the load balancers:

When /^I block the (.+)$/ do |device|
    raise("Unknown device #{device}") unless @ips.include?(device)
 
    @iptables.block(:ipaddr => @ips[device]) 
 
    raise("Not all nodes responded") unless @iptables.stats[:noresponsefrom].size == 0
end
 
When /^I unblock the (.+)$/ do |device|
    raise("Unknown device #{device}") unless @ips.include?(device)
 
    @iptables.unblock(:ipaddr => @ips[device])
 
    raise("Not all nodes responded") unless @iptables.stats[:noresponsefrom].size == 0
end

We do some very basic sanity checks here, simply catching nodes that did not respond and bailing out if there are any. Key is to note that to actually manipulate firewalls on any number of machines is roughly 1 line of code.

Now that we’re able to block and unblock IPs we also need a way to confirm those tasks were 100% done:

Then /^traffic from the (.+) should be blocked$/ do |device|
    raise("Unknown device #{device}") unless @ips.include?(device)
 
    unblockedon = @iptables.isblocked(:ipaddr => @ips[device]).inject(0) do |c, resp|
        c += 1 if resp[:data][:output] =~ /is not blocked/    
    end
 
    raise("Not blocked on: #{unblockedon} / #{@iptables.discovered} hosts") if unblockedon 
    raise("Not all nodes responded") unless @iptables.stats[:noresponsefrom].size == 0
end
 
Then /^traffic from the (.+) should be unblocked$/ do |device|
    raise("Unknown device #{device}") unless @ips.include?(device)
 
    blockedon = @iptables.isblocked(:ipaddr => @ips[device]).inject(0) do |c, resp|
        c += 1 if resp[:data][:output] =~ /is blocked/    
    end
 
    raise("Still blocked on: #{blockedon} / #{@iptables.discovered} hosts") if blockedon 
    raise("Not all nodes responded") unless @iptables.stats[:noresponsefrom].size == 0
end

This code does actual verification that the clients have the IP blocked or not. This code also highlights that perhaps my iptables agent needs some refactoring, I have two if blocks that checks for the existence of a string pattern in the result, I could make the agent return Boolean in addition to human readable results. This would make using the agent easier to use from a program like this.

That’s all there is to it really, MCollective RPC makes reusing code very easy and it makes addressing networks very easy.

 

Monitoring / Infrastructure Testing

The above code demonstrates how using MCollective+Cucumber you can address any number of machines, perform actions and get states within a testing framework. This seems an uncomfortable fit – since Cucumber is a testing framework – but it doesn’t need to be.

Above I am using cucumber to drive actions but it would be great to use this combination to do testing of infrastructure states using something like cucumber-nagios. The great thing that MCollective brings to the table here is that you can have sets of tests that changes behavior with the environment while having the ability to break out of the single box barriers.

With this you can easily write a kind of infrastructure test that transcends machine boundaries. You could check the state of one set of variables on one set of machines, and based on the value of those go and check that other machines are in a state that makes those variables valid variables to have.

We’re able to answer those ‘this machine is doing x, did the admin remember to do y on another machine?’ style questions. Examples of this could be:

  • If the backups are running, did the cron job that takes a database out of the service pool get run? This would flag up at any time, even if someone is doing a manual run of backups.
  • How many Puppet Daemons is currently actively doing manifests on all our nodes, alert if more than 10. Even this simple case is hard – you need a view of the status of an application in real time across many nodes, and requires information from now rather than the usual 5 minute window of Nagios.
  • If there are 10 concurrent puppetd runs happening right now, is the puppet master coping? This test would stay green, and not care for the master until the time comes that there are many puppetd’s doing manifest runs. This way if your backups or sysadmin action pushes the load up on the master the check will stay green, it will only trigger if you’re seeing many Puppet clients running. This could be useful indicators for capacity planning.

These simple cases are generally hard for systems like Nagios to do, it’s hard to track state of many checks, apply logic and then go CRITICAL if a combination of factors combine to give a failure, we can build such test cases with MCollective and Cucumber fairly easily.

The code here does not really show you how to do that per se, but what it does show is how natural and easy it is to interact with your network of hosts via MCollective and Ruby. In future I might post some more code here to show how we can build on these ideas and create test suites as described. As a example a test case for the above Puppet Master example might be:

Feature: Monitor the capacity of the Puppet Master
 
    Background:
    Given we know we can run 10 concurrent Puppet clients
    And the Puppet Master load average should be below 2
 
    Scenario: Monitor the Puppet Master capacity 
        When there are more than usual Puppet clients running
        Then the Puppet Master should have an acceptable load average

Running this under cucumber-nagios we’ll achieve our stated goals.

As a small post note, figuring out how many Puppet Daemons are currently running their manifests is trivial with the Puppet Agent:

p = rpcclient("puppetd")
p.progress = false
 
running = p.status.inject(0) {|c, status| c += status[:data][:running]}
puts("Currently running: #{running}")

$ ruby test.rb
Currently running: 3

The application is only running on the developer system

Hi, today I'm going to write about a situation every sysadmin has already encountered. The sysadmin gets a new version of some type of software and should install it on a server. After some hours of trying he calls the developer and tells him he's not getting the application to start. The first answer all of us get: "But it is running on my PC." Let the discussion start. ;) In my opinion it