↓ Archives ↓

Category → monitoringsucks

Our #monitoringsucks rpm is repository available

Not only our Rubygems Builds have changed, but also my internal #monitoringsucks repository.

You might have noticed a variety of vagrant- projects on my github acount

Being the #monitoringsucks part of them. All of those Vagrant projects are basically my test setups to play with those new tools.

They contain a bunch of puppet modules that install and configure these tools. (Note that they mostly consist of
of git submodules to other puppet module repositories.

Given the fact that I also like to have my software cleanly installed from a package, that means that some of these tools had to be packaged, or I had to create a personal / internal repository which had packages from upstream that were hiding on the internet available.

I've forked of this repository off the internal Inuits epository so you all can also benefit from these efforts.
(You gotta love pulp :))

That means you can now install all of the above mentionned #monitoringsucks tool from our public repo on

  1. yumrepo { 'monitoringsucks':
  2. baseurl => 'http://pulp.inuits.eu/pulp/repos/monitoring',
  3. descr => 'MonitoringSuck at Inuits',
  4. gpgcheck => '0',
  5. }

Patches to both the Vagrant projects and the puppet modules are welcome ...

Discovering Switches: It’s amazing what you can learn just by listening…

We recently added code to discover switches, switch ports and settings - all in the Steath DiscoveryTM way - without sending out any packets at all! So now you know which switches and which switch ports every monitored server is plugged into. As a bonus we pick up some interesting configuration information on your switch and your particular switch port - just by perking our ears up and listening... Now when you send someone to the closet to do something to your switch port, there is no doubt which port is yours - regardless of that little mistake in the cross-connects, or that tiny error in documentation. [Anyone want to write an iPad switch mapping app for this?]

Clients, Servers and Dependencies, Oh My!

One of the things that people have gotten most excited about in the Assimilation Monitoring Project in the area of discovery is the discovery of clients, servers and particularly dependencies. That code is now in the Assimilation code base. We discover client processes, server processes, and their interconnections. In this post, I'll explore how this works and what this looks like in the Neo4j graph database in this article. These dependencies are discovered without port scanning or packet sniffing - using Stealth DiscoveryTM methods.

What is Stealth Discovery?

One of the most interesting features in the Assimilation Monitoring Project is that it includes Continuous Integrated Stealth DiscoveryTM as part of the monitoring operation. So, what exactly is that, and why do I think is it so cool? Well, let's start with what it is, and hopefully it will become clear along the way why it's well worthwhile... Continuous - it runs all the time - discovering new and changed things in minutes to...

Managing Computer Systems with Dependency Information

From the perspective of managing a set of applications, a single tenant data center, or a multiple tenant data center, perhaps the most interesting, useful and fundamental information that one can come up with might be dependencies - and dependencies are basically graphs

Monitoring URLs by the thousands in Nagios

10K websites x 5 URL's to monitor

For our Atlassian Hosted Platform, we have about 10K websites we need to monitor. Those sites are monitored from a remote location to measure responsetime and availability. Each server would have about 5 sub URLs on average to check, resulting in 50K URL checks.

Currently we employ Nagios with check_http and require roughly about 14 Amazon Large Instances. While the nagios servers are not fully overloaded, we make sure that all checks would complete within a 5 minutes check cycle.

In a recent spike we investigated if we could do any optimizations to:

  • use less server resources, not only to reduce costs but also avoiding the management of multiple nagios servers which don't dynamically rebalance checks across multiple nagios hosts.
  • have all checks complete within a smaller window (say 1 minute), as this would increase our MTTD

While looking at this, we wanted the technology to be reusable with our future idea of a fully scalable and distributed monitoring in mind (think Flapjack or the new kid on the block Sensu). But for now, we wanted to focus on the checks only.

In the first blogpost of the series we look at the integration and options within Nagios. In a second blogpost we will provide proof of concept code for running an external process (ruby based) to execute and report back to nagios. Even though Nagios isn't the most fun to work with, a lot of solutions that try to replace it, focus on replacing the checks section. But Nagios gives you more the reporting, escalation, dependency management. I'm not saying there aren't solutions out there, but we consider that to be for another phase.

Check HTTP

The canonical way in Nagios to run a check is to execute Check_http.

F.i. to have it execute a check if confluence is working on https://somehost.atlassian.net/wiki , we would provide the options:

  • -H (virtual hostname), -I (ipaddress) , -p (port)
  • -u (path of url) , -S (ssl) , -f follow (follow redirects)
  • -t (timeout)

    $ /usr/lib64/nagios/plugins/check_http -H somehost.atlassian.net -p 443 -u /wiki -f follow -S -v -t 2 HTTP OK: HTTP/1.1 200 OK - 546 bytes in 0.734 second response time |time=0.734058s;;;0.000000 size=546B;;;0

Some observations:

  1. For each check configure Nagios will fork twice and exec check_http, avoiding this would improve performance as fork is considered expensive.
  2. If we were to have many URL's on the same host, we can't leverage connection reuse, making it less efficient
  3. For status checking, we can configure it to use the -J HEAD if our check doens't rely on the content of the page (saving on transfer time and reduce check time)
  4. Redirects: not an issue of Nagios, but we currently have quite a few redirects going from the login-page logic, reducing those would again improve check time.

We can reduce part of the forks by using the use_large_installation_tweaks=1 setting. The benefits and caveats are explained in the docs

Check scheduling

Nagios itself tries to be smart to schedule the checks. It tries to spread the number of service checks within the check interval you configure. More information can be found in older Nagios documentation .

Configuration options that influence the scheduling are:

  • normal_check_interval : how long between re-executing the same check
  • retry_check_interval : how fast to retry a check if it failed
  • check_period: total time for a complete check cycle
  • inter_check_delay_method: method to schedule checks (
  • service_interleave_factor: time between checks to the same remote host
  • max_concurrent_checks: obvious not ?
  • service_reaper_frequency : frequency to check for rogue checks

Default for the inter_check_delay_method is to use smart, if we want to execute the checks as fast as possible

  • n = Don't use any delay - schedule all service checks to run immediately (i.e. at the same time!)
  • d = Use a "dumb" delay of 1 second between service checks
  • s = Use a "smart" delay calculation to spread service checks out evenly (default)
  • x.xx = Use a user-supplied inter-check delay of x.xx seconds

Distributing checks

When one host can't cut it anymore, we have to scale eventually. Here are some solutions that live completely in the Nagios world:

Our future solution would have a similar approach to dispatching the checks command and gathering the results back over queue, but we'd like it to be less dependent on the Nagios solution and be possible to be integrated with other monitoring solutions (Think Unix Toolchain philosophy) A great example idea can be seen in the Velocityconf presentation Asynchronous Real-time Monitoring with Mcollective

Submitting check results back to Nagios

So with distribution we just split our problem again in smaller problems. So let's focus again on the single host running checks problem, after all, the more checks we can run on 1 host, the less we have to distribute.

Nagios Passive Checks easily allow you to uncouple the checks from your main nagios loop and submit the check results later. NSCA (Nagios Service Check Acceptor) is the most used solution for this.

NSCA does have a few limitations:

Opsview writes:

  • Only the first 511 bytes of plugin out was returned to the master, limiting the usefulness of the information you could display
  • Only the 1st line of data was returned, meaning you had to cramp output together
  • NSCA communication used fixed size packets which were inefficient
  • While results were sent, Nagios would wait for completion, introducing a bottleneck
  • If there was a communication problem with the master, results were dropped

This lead them to using NRD (Nagios Result Distributor)

Ryan Writes:

"What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel"

This lead him to writing A highperformance NSCA replacement involving feeding the result direct into the livestatus pipe instead of over the NSCA protocol baked into nagios On a similar note Jelle Smet has created NSCAWEb Easily submit passive host and service checks to Nagios via external commands

We would leverage the Send NSCA Ruby Gem

Why is this relevant to our solution? Without employing some of these optimizations, our bottleneck would shift from running the checks to accepting the check results.

Another solution could be run an NRPE server , and we could probably leverage some ruby logic from Metis - a ruby NRPE server


Even after the following optimizations:

  • using head vs get
  • large installation tweaks
  • tuning the inter_check_delay_method
  • parallel NSCA submissions vs serial submissions

we can still optimize with:

  • avoid the fork process by running all checks from the same process
  • reusing the http connection across multiple requests for the same host (potentially even do http pipelining

In the next blogpost we will show the results of proof of concept code involving ruby/eventmachine/jruby and various httpclient libraries.

The ultimate 2012 open source and devops conference

Kent Skaar pinged me last week , asking for feedback on Lisa'11 and input for Lisa 2012.

Thought I should share my advise to him with the rest of the world

So If I were to host an event similar to Lisa I'd had either
Jordan Sissel or Mitchell Hashimoto give the keynote because over the past 24 months those people have written more relevant tools for me than anyone else :)

I'd have someone talk about Kanban for Operations, There's 2 names that pop up Dominica DeGrandis and Mattias Skarin

I'd have the Ubuntu folks talk about JuJu and I'd have RI Pienaar talk about MCollective .. while you have RI have him talk about Hiera too. Have Dean Wilson carry RI's bags and put him unknowingly on a panel. (Masquerade it as a Pub with hidden cameras)

Obviously as #monitoringsucks you want to hear about new monitoring tools initiatives and how people are dealing with them , so you want people talking about Graphite, Collectd, Statsd, Sensu , Icinga-MQ And how people are reviving Ganglia and using that in large scale environments.

You want someone to demistify Queues, I mean .. who still knows about the differences between Active, Rabbit , Zero, Hornet and many other Q's ?

You want people talking about how they deal with logs, so talks about Logstash and Graylog2.

You want to cover Test Driven Infrastructure How do you test your infrastructure , someone to demystify Cucumber and Webrat , and talk about testing Charms, Modules, and Cookbooks.

Oh and Filesystems , distributed ones the Ceph, FraunhoverFS, Moose, KosmosFS, Glusters, Swifts of this world ... you want people to talk about their experiences , good and bad with any of the above, someone who can actually compare those rather than heresay stuff. :) With recent updates on what's going on in these projects.

Now someone please organise this for me :) In a warm and sunny place ... preferably with 27 holes next door , and daycare for my kids :)

PS. Yes the absence of any openstack related topic is on purpose .. that's for 2013 :)

We didn’t fix it

MonitoringSucks and we didn't fix it.

Earlier this week Inuits hosted a 2 day hackfest titled #MonitoringSucks. A good number of people with a variety of backgrounds showed up on monday morning. I don't know why but people had high expectations for this event , did they really expect us to fix the #monitoringsucks problem in a mere 2 days ?

Next to myselve we had Patrick Debois , Grégory Karékinian, Stefan Jourdan, Colin Humphreys, Andrew Crump, Ohad Levy , Frank Marien, Toshaan Bharvani, Devdas Bhagat, Maciej Pasternacki Axel Beckert Jelle Smet, Noa Resare @blippie , John John Tedro @udoprog, Christian Trabold @ctrabold and obviously some people I missed

A good mixture of Fosdem visitors that stayed a litte longer in our cold country and locals with ideas. We had people from TomTom, RedHat , Spotify, Booking.com, Inuits, Atlassian, coming from Belgium, The Netherlands France, Israel, the UK, Sweden, Germany, Poland and Switzerland if I`m not mistaken.

The format was pretty open, much of the first day was spend around the drawingboard.

people around the drawingboard
(Ohad Levy, Jelle Smet, PatrickDebois and Frank Marien) discussing a variety of topics

This monitoring topic is complex, there are different areas that need to be covered. The drawing below documents how we splitted the problem into different areas , and listed the different tools people use for these areas.

  • Collection: Collectd, Nagios, Ganglia
  • Transport: XMPP, Smiple, Smtp, 0mq , APMQ, rsyslog, irc, stomp
  • Storage : rrd, graphite, opentsdb, hbase,
  • Filtering: logstash, esper,
  • Visualisation : Graphite,
  • Notifcation: PagerDuty
  • Reporting: Jasper

Obviously above list is far from complete.

The afternoon discussion continued where we left of before lunch, just after the powercut. Only now we started refocussing on filtering and aggregating values using Logstash
@patrickdebois had been talking about the idea to use Logstash as a way to collect data , transform it and throw it either to another tool, or onto a Queue before.
Looking at Logstash it makes kind of sense. Logstash already has a zillion of input types, filters and outputs. Including popular queues such as amqp and zeromq. Yes, the default behaviour for a lot of people is to get data from different inputs, filter it and then send it to ElasticSearch, but much more is possible with the available outputs.

It was only on tuesday that people really started writing code
So what did really come out of the #monitoringsucks hackfest. ?

A couple of people were working on packaging existing tools for their favourite distro. Others were working on integrating a number of other already existing tools (e.g Patrick working on more inputs for Logstash., me working on replacing logster with Logstash, setting up Kibana etc. New tools were learned, items were added to todolists (Kibana, (doesn't work on older Firefox instances) Tattle, statsd) and items were scratched from todolists (Graylog2 (Kibana replaces that as a good Frontend for Logstash) )

A lot of experiences with different tools were exchanged

Frank Marien showed us a demo of his freshly release ExtremeMon framework. A really promising project.

The sad part about a workshop like this one is that you enter with a bunch of ideas , and leave with even more ideas, hence more work. We haven't solved the problem yet, but a lot of more people are now thiking about the problem and how to solve it a more modulare (unix style) approach. With different litte tools, all being good at something and all being interconnectable.

#monitoringsucks hackathon 6&7 february Practical details:

As announced earlier next monday and tuesday we're opening up the Inuits offices for everybody working on monitoring problems.

There's already a good number of people that have confirmed their presence and some people have asked

As for practical details .. the plan is simple.
I`m going to be at the place somewhere between 8:30 and 9:00 on monday. ( Hey .. it's the day after Fosdem you know :))

The only thing I've planned is to do a get to know eachother round around 10:30 after that I`m expecting the hackathon to be self organising,

There will be water, coffee , etc , IP connectivity, and electricity.

The location is still Duboisstraat 50, Antwerp

Free parking is on the Hardenvoort or Kempenstraat ( 3minutes walk) , paid parking right in front of the door.

Monitoring Wonderland Survey – Visualization

A picture tells more than a ...

Now that you've collected all the metrics you wanted or even more , it's time to make them useful by visualizing them. Every respecting metrics tool provides a visualization of the data collected. Older tools tended to revolve around creating RRD graphics from the data. Newer application are leveraging javascript or flash frameworks to have the data updated in realtime and rendered by the browser. People are exploring new ways of visualizing large amounts of data efficiently. A good example is Visualizing Device Utilization by Brendan Gregg. or Multi User - Realtime heatmap using Nodejs

Several interesting books have been written about visualization:

Dashboard written for specific metric tools


Graphs are Graphite's killer feature, but there's always room for improvement:

Grockets - Realtime streaming graphite data via socket.io and node.js


Graphs in Opentsdb are based on Gnuplot




Nagios also has a way to visualize metrics in it's UI

Overall integration

With all these different systems creating graphs, the nice folks from Etsy have provided a way to navigate the different systems easily via their dashboard - https://github.com/etsy/dashboard

I also like the Idea of Embeddable Graphs as http://explainum.com implements it

Development frameworks for visualization

Generic data visualization

There are many javascript graphing libraries. Depending on your need on how to visualize things, they provide you with different options. The first list is more a generic graphic library list

Time related libraries

To plot things many people now use:

For timeseries/timelines these libraries are useful:

And why not have Javascript generate/read some RRD graphs :

Annotations of events in timeseries:

On your graphs you often want event annotated. This could range from plotting new puppet runs , tracking your releases to everything that you do in the proces of managing your servers. This is what John Allspaw calls Ops-Metametrics

These events are usually marked as vertical lines.

Dependencies graphs

One thing I was wondering is that with all the metrics we store in these tools, we store the relationships between them in our head. I researched for tools that would link metrics or describe a dependency graph between them for navigation.

We could use Depgraph - Ruby library to create dependencies - based n graphviz to draw a dependency tree, but we obviously first have to define it. Something similar to the Nagios dependency model (without the strict host/service relationship of course)


With all the libraries to get data in and out and the power of javascript graphing libraries we should be able to create awesome visualizations of our metrics. This inspired me and @lusis to start thinking about creating a book on Metrics/Monitoring graphing patterns. Who knows ...