Category → monitoringsucks
10K websites x 5 URL's to monitor
For our Atlassian Hosted Platform, we have about 10K websites we need to monitor. Those sites are monitored from a remote location to measure responsetime and availability. Each server would have about 5 sub URLs on average to check, resulting in 50K URL checks.
Currently we employ Nagios with check_http and require roughly about 14 Amazon Large Instances. While the nagios servers are not fully overloaded, we make sure that all checks would complete within a 5 minutes check cycle.
In a recent spike we investigated if we could do any optimizations to:
- use less server resources, not only to reduce costs but also avoiding the management of multiple nagios servers which don't dynamically rebalance checks across multiple nagios hosts.
- have all checks complete within a smaller window (say 1 minute), as this would increase our MTTD
While looking at this, we wanted the technology to be reusable with our future idea of a fully scalable and distributed monitoring in mind (think Flapjack or the new kid on the block Sensu). But for now, we wanted to focus on the checks only.
In the first blogpost of the series we look at the integration and options within Nagios. In a second blogpost we will provide proof of concept code for running an external process (ruby based) to execute and report back to nagios. Even though Nagios isn't the most fun to work with, a lot of solutions that try to replace it, focus on replacing the checks section. But Nagios gives you more the reporting, escalation, dependency management. I'm not saying there aren't solutions out there, but we consider that to be for another phase.
The canonical way in Nagios to run a check is to execute Check_http.
F.i. to have it execute a check if confluence is working on https://somehost.atlassian.net/wiki , we would provide the options:
- -H (virtual hostname), -I (ipaddress) , -p (port)
- -u (path of url) , -S (ssl) , -f follow (follow redirects)
$ /usr/lib64/nagios/plugins/check_http -H somehost.atlassian.net -p 443 -u /wiki -f follow -S -v -t 2 HTTP OK: HTTP/1.1 200 OK - 546 bytes in 0.734 second response time |time=0.734058s;;;0.000000 size=546B;;;0
- For each check configure Nagios will fork twice and exec check_http, avoiding this would improve performance as fork is considered expensive.
- If we were to have many URL's on the same host, we can't leverage connection reuse, making it less efficient
- For status checking, we can configure it to use the -J HEAD if our check doens't rely on the content of the page (saving on transfer time and reduce check time)
- Redirects: not an issue of Nagios, but we currently have quite a few redirects going from the login-page logic, reducing those would again improve check time.
We can reduce part of the forks by using the use_large_installation_tweaks=1 setting. The benefits and caveats are explained in the docs
Nagios itself tries to be smart to schedule the checks. It tries to spread the number of service checks within the check interval you configure. More information can be found in older Nagios documentation .
Configuration options that influence the scheduling are:
- normal_check_interval : how long between re-executing the same check
- retry_check_interval : how fast to retry a check if it failed
- check_period: total time for a complete check cycle
- inter_check_delay_method: method to schedule checks (
- service_interleave_factor: time between checks to the same remote host
- max_concurrent_checks: obvious not ?
- service_reaper_frequency : frequency to check for rogue checks
Default for the inter_check_delay_method is to use smart, if we want to execute the checks as fast as possible
- n = Don't use any delay - schedule all service checks to run immediately (i.e. at the same time!)
- d = Use a "dumb" delay of 1 second between service checks
- s = Use a "smart" delay calculation to spread service checks out evenly (default)
- x.xx = Use a user-supplied inter-check delay of x.xx seconds
When one host can't cut it anymore, we have to scale eventually. Here are some solutions that live completely in the Nagios world:
Our future solution would have a similar approach to dispatching the checks command and gathering the results back over queue, but we'd like it to be less dependent on the Nagios solution and be possible to be integrated with other monitoring solutions (Think Unix Toolchain philosophy) A great example idea can be seen in the Velocityconf presentation Asynchronous Real-time Monitoring with Mcollective
Submitting check results back to Nagios
So with distribution we just split our problem again in smaller problems. So let's focus again on the single host running checks problem, after all, the more checks we can run on 1 host, the less we have to distribute.
NSCA does have a few limitations:
- Only the first 511 bytes of plugin out was returned to the master, limiting the usefulness of the information you could display
- Only the 1st line of data was returned, meaning you had to cramp output together
- NSCA communication used fixed size packets which were inefficient
- While results were sent, Nagios would wait for completion, introducing a bottleneck
- If there was a communication problem with the master, results were dropped
This lead them to using NRD (Nagios Result Distributor)
"What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel"
This lead him to writing A highperformance NSCA replacement involving feeding the result direct into the livestatus pipe instead of over the NSCA protocol baked into nagios On a similar note Jelle Smet has created NSCAWEb Easily submit passive host and service checks to Nagios via external commands
We would leverage the Send NSCA Ruby Gem
Why is this relevant to our solution? Without employing some of these optimizations, our bottleneck would shift from running the checks to accepting the check results.
Another solution could be run an NRPE server , and we could probably leverage some ruby logic from Metis - a ruby NRPE server
Even after the following optimizations:
- using head vs get
- large installation tweaks
- tuning the inter_check_delay_method
- parallel NSCA submissions vs serial submissions
we can still optimize with:
- avoid the fork process by running all checks from the same process
- reusing the http connection across multiple requests for the same host (potentially even do http pipelining
In the next blogpost we will show the results of proof of concept code involving ruby/eventmachine/jruby and various httpclient libraries.
Kent Skaar pinged me last week , asking for feedback on Lisa'11 and input for Lisa 2012.
Thought I should share my advise to him with the rest of the world
So If I were to host an event similar to Lisa I'd had either
Jordan Sissel or Mitchell Hashimoto give the keynote because over the past 24 months those people have written more relevant tools for me than anyone else :)
I'd have someone talk about Kanban for Operations, There's 2 names that pop up Dominica DeGrandis and Mattias Skarin
I'd have the Ubuntu folks talk about JuJu and I'd have RI Pienaar talk about MCollective .. while you have RI have him talk about Hiera too. Have Dean Wilson carry RI's bags and put him unknowingly on a panel. (Masquerade it as a Pub with hidden cameras)
Obviously as #monitoringsucks you want to hear about new monitoring tools initiatives and how people are dealing with them , so you want people talking about Graphite, Collectd, Statsd, Sensu , Icinga-MQ And how people are reviving Ganglia and using that in large scale environments.
You want someone to demistify Queues, I mean .. who still knows about the differences between Active, Rabbit , Zero, Hornet and many other Q's ?
You want people talking about how they deal with logs, so talks about Logstash and Graylog2.
You want to cover Test Driven Infrastructure How do you test your infrastructure , someone to demystify Cucumber and Webrat , and talk about testing Charms, Modules, and Cookbooks.
Oh and Filesystems , distributed ones the Ceph, FraunhoverFS, Moose, KosmosFS, Glusters, Swifts of this world ... you want people to talk about their experiences , good and bad with any of the above, someone who can actually compare those rather than heresay stuff. :) With recent updates on what's going on in these projects.
Now someone please organise this for me :) In a warm and sunny place ... preferably with 27 holes next door , and daycare for my kids :)
PS. Yes the absence of any openstack related topic is on purpose .. that's for 2013 :)
MonitoringSucks and we didn't fix it.
Earlier this week Inuits hosted a 2 day hackfest titled #MonitoringSucks. A good number of people with a variety of backgrounds showed up on monday morning. I don't know why but people had high expectations for this event , did they really expect us to fix the #monitoringsucks problem in a mere 2 days ?
Next to myselve we had Patrick Debois , Grégory Karékinian, Stefan Jourdan, Colin Humphreys, Andrew Crump, Ohad Levy , Frank Marien, Toshaan Bharvani, Devdas Bhagat, Maciej Pasternacki Axel Beckert Jelle Smet, Noa Resare @blippie , John John Tedro @udoprog, Christian Trabold @ctrabold and obviously some people I missed
A good mixture of Fosdem visitors that stayed a litte longer in our cold country and locals with ideas. We had people from TomTom, RedHat , Spotify, Booking.com, Inuits, Atlassian, coming from Belgium, The Netherlands France, Israel, the UK, Sweden, Germany, Poland and Switzerland if I`m not mistaken.
The format was pretty open, much of the first day was spend around the drawingboard.
(Ohad Levy, Jelle Smet, PatrickDebois and Frank Marien) discussing a variety of topics
This monitoring topic is complex, there are different areas that need to be covered. The drawing below documents how we splitted the problem into different areas , and listed the different tools people use for these areas.
- Collection: Collectd, Nagios, Ganglia
- Transport: XMPP, Smiple, Smtp, 0mq , APMQ, rsyslog, irc, stomp
- Storage : rrd, graphite, opentsdb, hbase,
- Filtering: logstash, esper,
- Visualisation : Graphite,
- Notifcation: PagerDuty
- Reporting: Jasper
Obviously above list is far from complete.
The afternoon discussion continued where we left of before lunch, just after the powercut. Only now we started refocussing on filtering and aggregating values using Logstash
@patrickdebois had been talking about the idea to use Logstash as a way to collect data , transform it and throw it either to another tool, or onto a Queue before.
Looking at Logstash it makes kind of sense. Logstash already has a zillion of input types, filters and outputs. Including popular queues such as amqp and zeromq. Yes, the default behaviour for a lot of people is to get data from different inputs, filter it and then send it to ElasticSearch, but much more is possible with the available outputs.
It was only on tuesday that people really started writing code
So what did really come out of the #monitoringsucks hackfest. ?
A couple of people were working on packaging existing tools for their favourite distro. Others were working on integrating a number of other already existing tools (e.g Patrick working on more inputs for Logstash., me working on replacing logster with Logstash, setting up Kibana etc. New tools were learned, items were added to todolists (Kibana, (doesn't work on older Firefox instances) Tattle, statsd) and items were scratched from todolists (Graylog2 (Kibana replaces that as a good Frontend for Logstash) )
A lot of experiences with different tools were exchanged
Frank Marien showed us a demo of his freshly release ExtremeMon framework. A really promising project.
The sad part about a workshop like this one is that you enter with a bunch of ideas , and leave with even more ideas, hence more work. We haven't solved the problem yet, but a lot of more people are now thiking about the problem and how to solve it a more modulare (unix style) approach. With different litte tools, all being good at something and all being interconnectable.
As announced earlier next monday and tuesday we're opening up the Inuits offices for everybody working on monitoring problems.
There's already a good number of people that have confirmed their presence and some people have asked
As for practical details .. the plan is simple.
I`m going to be at the place somewhere between 8:30 and 9:00 on monday. ( Hey .. it's the day after Fosdem you know :))
The only thing I've planned is to do a get to know eachother round around 10:30 after that I`m expecting the hackathon to be self organising,
There will be water, coffee , etc , IP connectivity, and electricity.
The location is still Duboisstraat 50, Antwerp
Free parking is on the Hardenvoort or Kempenstraat ( 3minutes walk) , paid parking right in front of the door.
A picture tells more than a ...
Several interesting books have been written about visualization:
- Designing with Data
- Visualize this
- Information Dashboard Design - Effective Communication
- Design by Nature
- Data Visualizations
- Chapter on visualization in Big Data Glossary Book
- The visual Display of Quantative Information
- Envisioning Information
- Visual and Statistical Thinking
Dashboard written for specific metric tools
Graphs are Graphite's killer feature, but there's always room for improvement:
- Graphiti - https://github.com/paperlesspost/graphiti an alternative well designed UI. To see it in action watch this presentation Metrics And you
- Pencil - https://github.com/fetep/pencil
- RI Pienaar has created Gdash - Graphite: version control, add graphs dsl, easy bookmarks
- Charcoal - Charcoal: Simple Graphite Templates
Graphs in Opentsdb are based on Gnuplot
- Opentsdb- Dashboard in Nodejs - https://github.com/clover/opentsdb-dashboard
- Otus - https://github.com/otus/otus - Web Dashboard build on top of Hadoop/Opentsdb for monitoring hadoop cluster -
- The New Ganglia Web - 2 is pretty slick!
- Visage - Web Interface to collectd - RRD
- a CollectD viewer by John Bergmans usine Websockets - AMQP - Collectd - realtime view: http://bergmans.com/WebSocket/collectdViewer.html
Nagios also has a way to visualize metrics in it's UI
With all these different systems creating graphs, the nice folks from Etsy have provided a way to navigate the different systems easily via their dashboard - https://github.com/etsy/dashboard
I also like the Idea of Embeddable Graphs as http://explainum.com implements it
Development frameworks for visualization
Generic data visualization
- Protovis-js : http://code.google.com/p/protovis-js
- Processing-js: http://processingjs.org/
- Raphael-js: http://raphaeljs.com/
- Flare: http://flare.prefuse.org/
- Google Fusion Tables : http://www.google.com/fusiontables
- Polymaps: http://polymaps.org/ex/
- Yahoo UI elements: http://developer.yahoo.com/yui
- Gephi: http://gephi.org
- Graphiz: http://www.graphviz.org
Time related libraries
To plot things many people now use:
For timeseries/timelines these libraries are useful:
- Simile Timeline - http://www.simile-widgets.org/timeline/
- Simile Timeline in Google Charts - http://code.google.com/apis/chart/interactive/docs/gallery/annotatedtimeline.html
- Dygraphs - http://dygraphs.com/ - that produces produces interactive, zoomable charts of time serie
Annotations of events in timeseries:
On your graphs you often want event annotated. This could range from plotting new puppet runs , tracking your releases to everything that you do in the proces of managing your servers. This is what John Allspaw calls Ops-Metametrics
These events are usually marked as vertical lines.
- RRD Vertical - works for Cacti, Munin, Collectd ... - http://blog.vuksan.com/2010/06/28/overlay-deploy-timeline-on-your-ganglia-graphs
- Ganglia - Overlay Events: http://ganglia.info/?p=382
- Graphite - Draws as infinite: http://readthedocs.org/docs/graphite/en/latest/functions.html
Graphite - Events to facilitate this: https://github.com/agoddard/graphite-events
One thing I was wondering is that with all the metrics we store in these tools, we store the relationships between them in our head. I researched for tools that would link metrics or describe a dependency graph between them for navigation.
We could use Depgraph - Ruby library to create dependencies - based n graphviz to draw a dependency tree, but we obviously first have to define it. Something similar to the Nagios dependency model (without the strict host/service relationship of course)
While all the previously described metric systems have easy protocols, they tend to stay in Sysadmin/Operations land. But you should not stop there. There is a lot more to track than CPU,Memory and Disk metrics. This blogpost is about metrics up the stack: at the Application Middleware, Application and the User Usage.
To the cloud
Maybe grumpy sysadmins have scared the developers and business to the cloud. It seems that the space of Application metrics, whether it's Ruby, Java , PHP is being ruled today by New Relic In a blogpost New Relic describes serving about 20 Billion Metrics A day.
- The New Relic - Ruby gem https://github.com/newrelic/rpm is the official one
It allows for easy instrumentation of ruby apps, but they also have support for PHP, Java, .NET, and Python
Part of their secret of success is the easy at how developers can get metrics from their application by adding a few files, and a token.
Several other cloud monitoring vendors are stepping into arena, and I really hope to see them grow the space and give some competition:
- Scout : https://scoutapp.com comes from the traditional server mangement and is slowly moving to the application metrics
- Librato : https://metrics.librato.com can lerage existing agents such as StatsD, CollectD, and JMX
- Boundary : https://boundary.com has a focus on realtime view of metrics
- DataDog: http://www.datadoghq.com goes for a complete overview of all your metrics
Some other complementary services, popular amongst developers are:
- Get Exceptional: http://www.getexceptional.com/ It tracks errors in web apps. It reports them in real-time and gathers the info you need to fix them fast.
- Airbrake: http://airbrake.io it collects errors generated by other applications, and aggregates the results for review.
- Alert grid: http://alert-grid.com/ a Workflow system , ala yahoo pipes for notifications
- Proby: http://probyapp.com/ Cron monitoring made simple
- Pingdom: http://www.pingdom.com/ Uptime and performance monitoring made easy
- Pagerduty: http://www.pagerduty.com/ Alerting that can be easily hooked into your existing monitoring solution
Check this blogpost on Monitoring Reporting Signal, Pingdom, Proby, Graphite, Monit , Pagerduty, Airbrake to see how they make a powerful team.
User tracking Metrics - Cloud
Clicks, Page view etc ...
Besides the application metrics, there is one other major player in web metrics. Google Analytics
I found several tools to get data out of it using the Google Analytics API
- Garb - A ruby wrapper for the google analytics API: https://github.com/vigetlabs/garb
- Gem for talking to Google Analytics API: https://github.com/rumble/gattica
- Google Analytics with Ruby and Garb: http://www.viget.com/extend/google-analytics-api-with-ruby-and-garb-making-it-even-easier
- More recent: Google Analytics Data Export API with Ruby + Gattica: http://www.seerinteractive.com/blog/google-analytics-data-export-api-with-rubygattica/2011/02/22
- Gattica - More recent fork by Deviantech, with goals and segment support: https://github.com/chrisle/gattica
With google Analytics there is always a delay on getting your data;
If you want to have realtime statistics/metrics checkout Gaug.es http://get.gaug.es :
- Use the Gauges gem: https://github.com/orderedlistinc/gauges-gem to import/export data
Haven't really gotten into this, but well worth exploring getting metrics out of A/B testing
- Optimizely: http://www.optimizely.com
- Visual Website Optimizer: http://visualwebsiteoptimizer.com
- Google Web Optimizer: http://www.google.com/websizeoptimizer
Page render time
Another important to track is the page render time. This is well explained in the Real User Monitoring- Chapter 10 - Complete Web Monitoring - O'Reilly Media
Again Newrelic provides RUM : Real User Monitoring. See How we provide real user monitoring: A quick technical review for more technical info
- Keynote: monitoring like a real user experiences your website - http://www.keynote.com
- Real User Monitoring New Relic - http://newrelic.com/rum
- Tracking metrics - Velvet Metrics: http://www.velvetmetrics.com
Who needs a cloud anyway
Putting your metrics into the cloud can be very convenient , but it has downsides:
- most tools don't have way to redirect/replicate the metrics they collect internally
- that makes it hard to correlate with your internal metrics
- it's easy to get metrics in, but hard to get the full/raw data out again
- it depends on the internet , duh, and sometimes this fails :)
- or privacy or the volume of metrics just isn't possible to put it out in the cloud
Application Metrics - Non - Cloud
In his epic Metrics Anywhere, Codahale explains the importance of instrumenting your code with metrics. This looks very promising as this is really driven from the developers world:
- CodaHale Metrics: https://github.com/codahale/metrics allows Capturing JVM- and application-level metrics.
- Simon Java - Simple Monitoring API: http://code.google.com/p/javasimon
- Stajistics a free monitoring and runtime performance statistics collection API for Java https://code.google.com/p/stajistics
- Parfait- Java performance framework: https://code.google.com/p/parfait
Or you can always use JMX to monitor/metrics from your application
- JMX - JR Ruby Jmx: https://github.com/enebo/jmxjr allows you to access the Mbeans as a ruby class
- An example using JMX and Jruby: https://github.com/nicksieger/advent-jruby
And with JMX-trans http://code.google.com/p/jmxtrans you can feed jmx information into Graphite, Ganglia, Cacti/Rrdtool,
- Ruby Metrics Equivalent by JohnEwart: https://github.com/johnewart/ruby-metrics
- Ruby - FnordMetric https://github.com/paulasmuth/fnordmetric is a highly configurable (and pretty fast) realtime app/event tracking thing based on ruby eventmachine and redis. You define your own plotting and counting functions as ruby blocks!
- Pinba - Monitoring Php Processing using Timers - http://pinba.org/wiki/Main_Page
- A good overview post about collecting application metrics in java
- An opensource New Relic clone : https://github.com/devmen/FreeRelic
Esty style: StatsD
To collect various metrics, Etsy has created StatD https://github.com/etsy/statsd a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite.
There have been written clients in many languages php, java, ruby etc..
- Ruby gems for registering metrics with Statsd - http://rubydoc.info/gems/fozzie
- A Statsd Server in Ruby - https://github.com/fetep/ruby-statsd
- A Statsd Client in Ruby - https://github.com/github/statsd-ruby
- Another Statsd Client in Ruby - http://github.com/bvandenbos/statsd-client
- A Statsd client that isn't a direct port- https://github.com/reinh/statsd
- Statd instrumentation via Metaprogramming Methods in Ruby - https://github.com/shopify/statsd-instrument
It's incredible to see the power and simplicity of this; I've created a simple Proof of Concept to extract the statsd metrics on ZeroMQ in this experimental fork
MetricsD https://github.com/tritonrc/metricsd tries to marry both Etsy's statsD and the Coda Hale / Yammer's Metrics Library for the JVM and puts the data into Graphite. It should be drop-in compatible with Etsy's statsd, although with added explicit support for meters (with the m type) and gauges (with the g type) and introduce the h (histogram) type as an alias for timers (ms).
User tracking - Non Cloud
Clicks, Page view etc ...
Here are some Open Source Web Analytics libraries. These are merely links, haven't investigated it enough, work in progress
- Open Web Analytics
- Grape Web Statistics
- Ruwa - Ruby on Rails Web Analytics
- Riopro/piwik - ruby gem
- JKraemer piwik-tracker
- Autometal - Piwik ruby gem
- Awstats Reader - in Python
Another tool worth mentioning for tracking endusers is HummingBird - http://hummingbirdstats.com/ . It is NodeJS based an allows for realtime web traffic visualization. To send metrics is has a very simple UDP protocol.
He pointed out several A/B testing frameworks:
- ABingo : Rails A/B Testing - http://www.bingocardcreator.com/abingo
- Seven Minute Abs: Rails A/B Testing - https://github.com/paulmars/seven_minute_abs
- Vanity: Experiment Driven Development - http://vanity.labnotes.org/
And presented his own A/B Testing framework: Split - http://github.com/andrew/split
It would be interesting to integrate this further into traditional Monitoring/Metrics tools. View metrics per new version/enabled flags etc... In a Nutshell food for thought.
Page render time
For checking the page render time, I could not really found Open Source Alternatives.
It's exciting to see the cross over between both development, operations and business. Up until now only New Relic has a very well integrated suite for all metrics. Hope the internal solutions catch up.
Now that we have all that data, it's time to talk about dashboards and visualization. On to the next blogpost.
If you are using other tools, have ideas, feel free to add them in the comments.
Given that @patrickdebois is working on improving data collection I thought it would be a good idea to describe the setup I currently have hacked together.
(Something which can be used as a starting point to improve stuff, and I have to write documentation anyhow)
I currently have 3 sources , and one target, which will eventually expand to at least another target and most probably more sources too.
The 3 sources are basically typical system data which I collect using collectd, However I`m using collectd-carbon from https://github.com/indygreg/collectd-carbon.git to send data to Graphite.
I`m parsing the Apache and Tomcat logfiles with logster , currently sending them only to Graphite, but logster has an option to send them to Ganglia too.
And I`m using JMXTrans to collect JMX data from Java apps that have this data exposed and send it to Graphite. (JMXTrans also comes with a Ganglia target option)
Rather than going in depth over the config it's probably easier to point to a Vagrant box I build https://github.com/KrisBuytaert/vagrant-graphite which brings up a machine that does pretty much all of this on localhost.
Obviously it's still a work in progress and lots of classes will need to be parametrized and cleaned up. But it's a working setup, and not just on my machine ..
If you are hacking on monitoring solutions, and want to talk to your peers solving the problem
Block the monday and tuesday after fosdem in your calendar !
That's right on february 6 and 7 a bunch of people interrested to fix the problem will be meeting , discussing and hacking stuff together in Antwerp
In short a #monitoringsucks hackathon
Inuits is opening up their offices for everybody who wants to join the effort Please let us (@KrisBuytaert and @patrickdebois) know if you want to join us in Antwerp
Obviously if you can't make it to Antwerp you can join the effort on ##monitoringsucks on Freenode or on Twitter.
The location will be Duboistraat 50 , Antwerp
It is about 10 minutes walk from the Antwerp Central Trainstation
Depending on Traffic Antwerp is about half an hour north of Brussels and there are hotels at walking distance from the venue.
Plenty of parking space is available on the other side of the Park