Crowbar is quietly getting more interesting (video)
Crowbar is an interesting project that I've covered before. Born out of Dell's cloud group, much of the initial buzz described it as an installer for the cloud era... "kickstart on steroids", if you will.
Crowbar's close association with the OpenStack project has further cemented its reputation as an installer to watch. But's it's Crowbar's quiet potential as a stack management tool that is the most interesting. Through the use of barclamps (Crowbar's modules) you can tell Crowbar to build a full stack from the BIOS config all the way up to your middleware and applications. John Willis on an episode of DevOps Cafe called it "Data Center as Code".
Crowbar barclamps are also an interesting way for independent projects or vendors to ensure that their projects/products can be easily integrated into a custom platform (today this type of focus is usually in the context of making things work on OpenStack). Want to add a new component to your platform? Grab the barclamp and Crowbar will know how to do the rest. Or at least that is the promise. The project is still young and the community is still forming.
Leading open source software projects is new territory for Dell, as a company, but the Crowbar team does seem committed and community focused. I've heard some grumbles from developers that barclamp development and testing cycles can be a bit tedious due to the nature of what you are building. But no reason to believe that those types of issues won't get sorted out over time.
A couple of Crowbar related videos are below:
The first video was made by my DTO Solutions colleague, Keith Hudgins, after he wrote a barclamp for Zenoss. It's a short demo and tour that can give you a feel for Crowbar and Barclamps.
The next video is Barton George (Dell) interviewing Rob Hirshfeld (Dell). They start off talking about the Hadoop barclamp but quickly getting into a broader discussion about Crowbar.
Devops a Wicked problem
One of the strong pillars of devops (if not the strongest) is the collaboration/communication. For the talk about Devops Metrics for Velocity 2011 I researched how to prove collaboration is a good thing: while discussing devops to people it sometimes comes to believe that it makes sense to collaborate more or that all this collaboration is overkill. I think at time I came across Design Thinking and read how it evolved from 1 person doing the design to listening to user requirements to participatory design. In the book Design Thinking - Understanding Designers Think Nigel Cross writes that design used to be collaborative thing (like guilds trying to push their craft forward).
Symmetry of Ignorance
One of the concepts introduced was the symmetry of ignorance PDF
Complex design problems require more knowledge than any one single person can possess, and the knowledge relevant to a problem is often distributed and controversial. Rather than being a limiting factor, “symmetry of ignorance” can provide the foundation for social creativity. Bringing different points of view together and trying to create a shared understanding among all stakeholders can lead to new insights, new ideas, and new artifacts. Social creativity can be supported by new media that allow owners of problems to contribute to framing and solving these problems. These new media need to be designed from a meta-design perspective by creating environments in which stakeholders can act as designers and be more than consumers.
Sounds like systems thinking and reminded me of the knowledge divide within the devops problem space. When you spend time with each group/silo individually they would of think themselves superior to the other group: "ha those devs they don't know anything about the systems, ha those ops don't anything about coding". So it seems more about the symmetry of arrogance . That arrogance symmetry reminded "We judge others by their behavior, we judge ourselves by our intentions". We might think we know more/can do better, but that often not visible in our actions.
This kind of got me intrigued and I wanted to explore the subject more for the next Cutter Summit 2012.
Wicked Problem
Part of the designing thinking and this symmetry of ignorance is related to the concept of wicked problems
Rittel and Webber's (1973) formulation of wicked problems specifies ten characteristics:
- There is no definitive formulation of a wicked problem (defining wicked problems is itself a wicked problem).
- Wicked problems have no stopping rule.
- Solutions to wicked problems are not true-or-false, but better or worse.
- There is no immediate and no ultimate test of a solution to a wicked problem.
- Every solution to a wicked problem is a "one-shot operation"; because there is no opportunity to learn by trial and error, every attempt counts significantly.
- Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, nor is there a well-described set of permissible operations that may be incorporated into the plan.
- Every wicked problem is essentially unique.
- Every wicked problem can be considered to be a symptom of another problem.
- The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem's resolution.
- The planner has no right to be wrong (planners are liable for the consequences of the actions they generate).
I'll let you judge if you think devops (or even monitoring sucks :) is a wicked problem
More readings to explore:
- Evaluating the Semantic Approach through Horst Rittel's Second-Generation System Analysis
- Why Horst W.J. Rittel Matters
- Lean Essays - Wicked Problems
- Development is Inherently wicked
- Power and Interest in Developing Knowledge Societies
- Please list out 3 most important Tactics for solving Wicked Problems?
- Exploring ‘design thinking’ and organizational change: A Conversation
- The Lost Operational Art: Invigorating Campaigning into the Australian Defence Force
- Complexity, the “New Normal” 2: Leading to the Essence
- Complexity the New Norm 3: Listen to your guts – Are they really on the same page?
- Complexity the New Norm 4: Improving Sales Performance – Are you ready for the Challenge?
- Transcending the Individual Human Mind
- Bounding Wicked Problems: The C2 of Military Planning Topic 3: Information Sharing and Collaboration Processes and Behaviors
- Dimensions of a Spiral
Cynefin
The whole discission on what is a wicked problem or not reminded me of a talk by Dave Snowden. He helped creating the Cynefin model.
The Cynefin framework has five domains.The first four domains are:
![]()
- Simple, in which the relationship between cause and effect is obvious to all, the approach is to Sense - Categorise - Respond and we can apply best practice.
- Complicated, in which the relationship between cause and effect requires analysis or some other form of investigation and/or the application of expert knowledge, the approach is to Sense - Analyze - Respond and we can apply good practice.
- Complex, in which the relationship between cause and effect can only be perceived in retrospect, but not in advance, the approach is to Probe - Sense - Respond and we can sense emergent practice.
- Chaotic, in which there is no relationship between cause and effect at systems level, the approach is to Act - Sense - Respond and we can discover novel practice.
- Disorder
Note this a sense making framework, not a ordering framework: it's not always that exact to put your problems in each of the spaces, but it gets you thinking about which solutions to apply to which problems. And it fits in nicely with other frameworks as explained in A Tour of Adoption and Transformation models
So devops in my opinion, falls into the complex problem space.
A great video explaining it was recorded at the ALE 2011:
He explains many things, but here a few things that resonated with me:
- why in some problem spaces there is no best practice but only good practice
- we have to create fail-safe environments
- providing a solution to the problems in complex problems can be done by probing
- the human factor makes the difference / we are not machines (automation)
- the solution is often easy once you have solved it but you need to go through the proces of discovery.
that last point reminded me of the Debt Metaphor - Ward Cunningham. @littleidea explained that Ward was using a different concept for Technical Debt that most people use: he explains technical debt as the difference between the implementation and the ideal implementation on hinsight. Not because of bad implementation, or deliberate shortcuts, but because of new insights gathered during the discovery/problem solving process.
More research can be found at:
- More on Chaos and Cynefin- Tom Graves / Tetradian
- Simple vs Complicated vs Complex vs Chaotic
- Scan Agile 2009 - Snowden
- Finding the Simplicity Embedded in Complexity
- Cognitive Kanban
The fact that problems don't always stay/match one of the locations on the diagram is greatly visualized by adding dimensions to the diagram (a thing that got lost in the initial publication)
To tackle complex problems he suggests using three principles of complexity based management:
- Use fine grained objects: avoid "chunking"
- Distributed Cognition: the wisdom but not the foolishness of crowds
- Disintermediation: connecting decision makers with raw data
This could result in the Resilient Organisation
Resilience engineering
Because in complex systems it's hard to predict the exact behavior, Dave Snowden also talks about going From Robustness to Resiliance. It almost sounded like the difference between MTBF and MTTR like John Allspaw explains in Outages Post-Mortems and Human Error 101.
I came across those articles but never put them into the light of the Snowden perspective. More to explore so.
- Why complex systems fail
- How Complex systems fail - a webops perspective
- How resilience engineering applies to the web world
- Resilience Engineering: Part 1
- Why resilience is a term worth preserving
- Beyond Resilience: Visionary Adaptation
Silos and Resilience
The final document I'd like to highlight is about Reducing the impact of Organisational Silos on Resilience.
Stone quotes five questions suggested by Angela Drummond (a practitioner in the area of silo breaking and organisational change) to help executives identify and overcome silos.
- “does your organisation structure promote collaboration, or do silos exist?
- “do you have collaboration in your culture and as part of your value system?
- “do you have the IT infrastructure for effective collaboration?
- “do you believe in collaboration? Do you model that belief?
- “do you have a reward system for collaboration?
Quoting from the article:
Resilience cannot be achieved in isolation of other units and organisations. In summary, there is a need to recognise:
- the characteristics of silo formation, particularly in the creation of new organisational structures or as part of change management processes
- a convergence of interests, taking account of the fact that “we are all in this together”. Efforts are needed to achieve seamless internal relationships at the intraorganisational level and a commitment to work with others to advance community resilience (perhaps with a judicious contribution from government) at the broader societal level
- the case for collaboration. Gains are often possible by pooling ideas and resources (the total is greater than the sum of the parts)
- the value of harnessing grass-root capability including through continuous knowledge-building and sharing learnings in a trusted environment
- that cost-effectiveness calculations don’t easily take account of broad organisational or social needs and that the analysis may need supplementation if wide objectives are to be met
Leadership is the key to bringing these elements together. Leadership is needed to reduce and mitigate risks before crises occur.
It was fascinating to read the collaboration and resilience go hand in hand. And breaking the silos is really a must there and requires collaboration. Also the inter-company silos fits in nicely with The Agile Executive - A new Context for Agile presentation on how we come to rely on external services in a SAAS model and this will be another silo to tackle.
Final note
This is all research in progress, but it's exciting to see a lot of different concepts fit in nicely. I apologize that this isn't yet a complete polished train of thought, but it might be useful to explore more on the subject.
Convincing management that cooperation and collaboration was worth it
While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at Flickr, and why. I did this in the hope that other Yahoo properties could learn from that team’s process and culture, which we worked really hard at building and keeping.
The idea that Development and Operations could:
- Share responsibility/accountability for availability and performance
- Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response
- Build and maintain a deferential culture to each other when it came to domain expertise
- Cultivate equanimity when it came to emergency response and post-mortem meetings
…wasn’t evenly distributed across other Yahoo properties, from my limited perspective.
But I knew (still know) lots of incredible engineers at Yahoo that weren’t being supported as they could be by their upper management. So sending this letter was driven by wanting to help their situation. Don’t get me wrong, not everything was rainbows and flowers at Flickr, but we certainly had a lot more of them than other Yahoo groups.
When I re-read this, I’m reminded that when I came to Etsy, I wasn’t entirely sure that any of these approaches would work in the Etsy Engineering environment. The engineering staff at Etsy was a lot larger than Flickr’s and continuous deployment was in its infancy when I got there. I can now happily report that 2 years later, these concepts not only solidified at Etsy, they evolved to accommodate a lot more than what challenged us at Flickr. I couldn’t be happier about how it’s turned out.
I’ll note that there’s nothing groundbreaking in this note I sent, and nothing that I hadn’t said publicly in a presentation or two around the same time.
This is the note I sent to the three layers of management above me in my org at Yahoo:
Subject: Why Flickr went from 73rd most popular Y! property in 2005 to the 6th, 5 years later.
Below are my thoughts about some of the reasons why Flickr has had success, from an Operations Engineering manager’s point of view.
When I say everyone below, I mean all of the groups and sub-groups within the Flickr property: Product, Customer Care, Development, Service Engineering, Abuse and Advocacy, Design, and Community Management.
Here are at least some of the reasons we had success:
- Product included and respected everyone’s thoughts, in almost every feature and choice.
- Everyone owned availability of the site, not just Ops.
- Community management and customer service were involved early and often. In everything. If they weren’t, it was an oversight taken seriously, and would be fixed.
- Development and Operations had zero divide when it came to availability and performance. No, really. They worked in concert, involving each other in their own affairs when it mattered, and trusting each other every step of the way. This culture was taught, not born.
- I have never viewed Flickr Operations as firefighters, and have never considered Flickr Dev Engineering to be arsonists. (I have heard this analogy elsewhere in Yahoo.) The two teams are 100% equal partners, with absolute transparency. If anything, we had a problem with too much deference given between the two teams.
- The site was able to evolve, change, and grow as fast as needed to be as long as it was made safe to do so. To be specific: code and config deploys. When it wasn’t safe, we slowed, and everyone was fine with that happening, knowing that the goal was to return to fast-as-we-need-to-be. See above about everyone owning availability.
- Developers were able to see their work almost instantly in production. Institutionalized fear of degradation and outage ensured that changes were as safe as they needed to be. Developers and Ops engineers knew intuitively that the safety net you have is the one that you have built for yourself. When changes are small and frequent, the causes of degradation or outage due to code deploys are exceptionally transparent to all involved. (Re-read above about everyone owning availability.)
- We never deployed “early and often” because it was:
- a trend,
- we wanted to brag,
- or because we think we’re better than anyone. (We did it because it was right for Flickr to do so.)
- Everyone was made aware of any launches that had risks associated with it, and we worked on lists of things that could possibly go wrong, and what we would do in the event they did go wrong. Sometimes we missed things, and we had to think quickly, but those times were rare with new feature launches.
- Flickr Ops had always had the “go or no-go” decision, as did other groups who could vote with respect to their preparedness. A significant part of my job was working towards saying “go”, not “no-go”. In fact, almost all of it.
Examples: the most boring (anti-climatic, from an operational perspective) launches ever
- Flickr Video: I actually held the launch back by some hours until we could rectify a networking issue that I thought posed a risk to post-launch traffic. Other than that, it was a switch in the application that was turned from off to on. The feature’s code had been on prod servers for months in beta. See ‘dark launch’
- Homepage redesign: Unprecedented amount of activity data being pulled onto the logged-in homepage, order of magnitude increase in the number of calls to backend databases. Why was it boring? Because it was dark launched 10 days earlier. The actual launch was a flip of the ‘on’ switch
- People In Photos (aka, ‘people tagging’): Because the feature required data that we didn’t actually have yet, we couldn’t exactly dark launch it. It was a feature that had to be turned on, or off. Because of this, Flickr’s Architect wrote out a list of all of the parts of the feature that could cause load-related issues, what the likelihood of each was, how to turn those parts of the feature off, what custome care affect it might have, and what contingencies would probably require some community management involvement.
Dark Launches
When we already have the data on the backend needed to display for a new feature, we would ‘dark launch’, meaning that the code would make all of the back-end calls (i.e. the calls that bring load-related risk to the deploy) and simply throw the data away, not showing it to the user. We could then increase or decrease the percentage of traffic who made those calls in safety, since we never risked the user experience by showing them a new feature and then having to take it away because of load issues.
This increases everyone’s confidence almost to the point of apathy, as far as fear of load-related issues are concerned. I have no idea how many code deploys there were made to production on any given day in the past 5 years (although I could find it on a graph easily), because for the most part I don’t care, because those changes made in production have such a low chance of causing issues. When they have caused issues, everyone on the Flickr staff can find on a webpage when the change was made, who made the change, and exactly (line-by-line) what the change was.
In the case where we had confidence in the resource consumption of a feature, but not 100% confidence in functionality, the feature was turned on for staff only. I’d say that about 95% of the features we launched in those 5 years were turned on for staff long before they were turned on for the entire Flickr population. When we still didn’t feel 100% confident, we ramped up the percentage of Flickr members who could see and use the new feature slowly.
Config Flags
We have many pieces of Flickr that are encapsulated as ‘feature’ flags, which look as simple as: $cfg[disable_feature_video] = 0; this allows the site to be much more resilient to specific failures. If we have any degradation within a certain feature, we can simply turn that feature off in many cases, instead of taking the entire site down. These ‘flags’ have, in the past, been prioritized with conversations with Product, so there is an easy choice to make if something goes wrong and site uptime becomes opposed to feature uptime.
This is an extremely important point: Dark Launches and Config Flags, were concepts and tools created by Flickr Development, not Flickr Operations, even though the end-result of each points toward a typical Operations goal: stability and availability. This is a key distinction. These are initiatives made by Engineering leadership because devs feel protective of the availability of the site, respectful of Operations responsibilities, and just plain good engineering.
If the Flickr Operations had built these tools and approaches to keeping the site stable, I do not believe we would have the same amount of success.
There is more on this topic here: http://code.flickr.com/blog/2009/12/02/flipping-out/
Summary
Flickr Operations is in an enviable position in that they don’t have to convince anyone in the Flickr property that:
- Operations has ‘go or no-go’ decision-making power, along with every other subgroup.
- Spending time, effort, and money to ensure stable feature launches before they launch is the rule, not the exception.
- Continuous Deployment is better for the availability of the site
- Flickr Operations should be involved as early as possible in the development phase of any project
These things are taken for granted. Any other way would simply feel weird.
I have no idea if posting this letter helps anyone other than myself, but there you go.
Monitoring Wonderland Survey – Visualization
A picture tells more than a ...
Now that you've collected all the metrics you wanted or even more , it's time to make them useful by visualizing them. Every respecting metrics tool provides a visualization of the data collected. Older tools tended to revolve around creating RRD graphics from the data. Newer application are leveraging javascript or flash frameworks to have the data updated in realtime and rendered by the browser. People are exploring new ways of visualizing large amounts of data efficiently. A good example is Visualizing Device Utilization by Brendan Gregg. or Multi User - Realtime heatmap using Nodejs
Several interesting books have been written about visualization:
- Designing with Data
- Visualize this
- Information Dashboard Design - Effective Communication
- Design by Nature
- Data Visualizations
- Chapter on visualization in Big Data Glossary Book
- The visual Display of Quantative Information
- Envisioning Information
- Visual and Statistical Thinking
Dashboard written for specific metric tools
Graphite
Graphs are Graphite's killer feature, but there's always room for improvement:
- Graphiti - https://github.com/paperlesspost/graphiti an alternative well designed UI. To see it in action watch this presentation Metrics And you
- Pencil - https://github.com/fetep/pencil
- RI Pienaar has created Gdash - Graphite: version control, add graphs dsl, easy bookmarks
- Charcoal - Charcoal: Simple Graphite Templates
- Graphite - Jquery - https://github.com/prestontimmons/graphitejs - if you want to do it all in Javascript
Grockets - Realtime streaming graphite data via socket.io and node.js
Opentsdb
Graphs in Opentsdb are based on Gnuplot
- Opentsdb- Dashboard in Nodejs - https://github.com/clover/opentsdb-dashboard
- Otus - https://github.com/otus/otus - Web Dashboard build on top of Hadoop/Opentsdb for monitoring hadoop cluster -
Ganglia
- The New Ganglia Web - 2 is pretty slick!
Collectd
- Visage - Web Interface to collectd - RRD
- a CollectD viewer by John Bergmans usine Websockets - AMQP - Collectd - realtime view: http://bergmans.com/WebSocket/collectdViewer.html
Nagios
Nagios also has a way to visualize metrics in it's UI
Overall integration
With all these different systems creating graphs, the nice folks from Etsy have provided a way to navigate the different systems easily via their dashboard - https://github.com/etsy/dashboard
I also like the Idea of Embeddable Graphs as http://explainum.com implements it
Development frameworks for visualization
Generic data visualization
There are many javascript graphing libraries. Depending on your need on how to visualize things, they provide you with different options. The first list is more a generic graphic library list
- Protovis-js : http://code.google.com/p/protovis-js
- Processing-js: http://processingjs.org/
- Raphael-js: http://raphaeljs.com/
- Flare: http://flare.prefuse.org/
- Google Fusion Tables : http://www.google.com/fusiontables
- Polymaps: http://polymaps.org/ex/
- Yahoo UI elements: http://developer.yahoo.com/yui
- Gephi: http://gephi.org
- Graphiz: http://www.graphviz.org
Time related libraries
To plot things many people now use:
- Flot: http://code.google.com/p/flot/
- Ruby interface to Flot: https://github.com/pbosetti/flotr
For timeseries/timelines these libraries are useful:
- Simile Timeline - http://www.simile-widgets.org/timeline/
- Simile Timeline in Google Charts - http://code.google.com/apis/chart/interactive/docs/gallery/annotatedtimeline.html
- Dygraphs - http://dygraphs.com/ - that produces produces interactive, zoomable charts of time serie
Rickshaw - https://github.com/shutterstock/rickshaw : A JavaScript toolkit for creating interactive real-time graphs
D3 - Data Driven Documents - http://mbostock.github.com/d3 . To see it in action check out Cube -https://github.com/square/cube/wiki, a tool that uses D3, Redis for realtime visualizations.
And why not have Javascript generate/read some RRD graphs :
- Javascript RRD Graph - https://github.com/manuelluis/jsrrdgraph
- Javascript for reading/interpreting RRD files - http://javascriptrrd.sourceforge.net
- Pure javascript RRD file manipulation implementation - https://github.com/tjfontaine/javascript-rrd
Annotations of events in timeseries:
On your graphs you often want event annotated. This could range from plotting new puppet runs , tracking your releases to everything that you do in the proces of managing your servers. This is what John Allspaw calls Ops-Metametrics
These events are usually marked as vertical lines.
- RRD Vertical - works for Cacti, Munin, Collectd ... - http://blog.vuksan.com/2010/06/28/overlay-deploy-timeline-on-your-ganglia-graphs
- Ganglia - Overlay Events: http://ganglia.info/?p=382
- Graphite - Draws as infinite: http://readthedocs.org/docs/graphite/en/latest/functions.html
Graphite - Events to facilitate this: https://github.com/agoddard/graphite-events
Opentsbd - has a feature request for annotations but is not yet implemented
Dependencies graphs
One thing I was wondering is that with all the metrics we store in these tools, we store the relationships between them in our head. I researched for tools that would link metrics or describe a dependency graph between them for navigation.
We could use Depgraph - Ruby library to create dependencies - based n graphviz to draw a dependency tree, but we obviously first have to define it. Something similar to the Nagios dependency model (without the strict host/service relationship of course)
Conclusion
With all the libraries to get data in and out and the power of javascript graphing libraries we should be able to create awesome visualizations of our metrics. This inspired me and @lusis to start thinking about creating a book on Metrics/Monitoring graphing patterns. Who knows ...
Monitoring Wonderland Survey – Moving up the stack Application and User metrics
While all the previously described metric systems have easy protocols, they tend to stay in Sysadmin/Operations land. But you should not stop there. There is a lot more to track than CPU,Memory and Disk metrics. This blogpost is about metrics up the stack: at the Application Middleware, Application and the User Usage.
To the cloud
Application Metrics
Maybe grumpy sysadmins have scared the developers and business to the cloud. It seems that the space of Application metrics, whether it's Ruby, Java , PHP is being ruled today by New Relic In a blogpost New Relic describes serving about 20 Billion Metrics A day.
- The New Relic - Ruby gem https://github.com/newrelic/rpm is the official one
It allows for easy instrumentation of ruby apps, but they also have support for PHP, Java, .NET, and Python
Part of their secret of success is the easy at how developers can get metrics from their application by adding a few files, and a token.
Several other cloud monitoring vendors are stepping into arena, and I really hope to see them grow the space and give some competition:
- Scout : https://scoutapp.com comes from the traditional server mangement and is slowly moving to the application metrics
- Librato : https://metrics.librato.com can lerage existing agents such as StatsD, CollectD, and JMX
- Boundary : https://boundary.com has a focus on realtime view of metrics
- DataDog: http://www.datadoghq.com goes for a complete overview of all your metrics
Some other complementary services, popular amongst developers are:
- Get Exceptional: http://www.getexceptional.com/ It tracks errors in web apps. It reports them in real-time and gathers the info you need to fix them fast.
- Airbrake: http://airbrake.io it collects errors generated by other applications, and aggregates the results for review.
- Alert grid: http://alert-grid.com/ a Workflow system , ala yahoo pipes for notifications
- Proby: http://probyapp.com/ Cron monitoring made simple
- Pingdom: http://www.pingdom.com/ Uptime and performance monitoring made easy
- Pagerduty: http://www.pagerduty.com/ Alerting that can be easily hooked into your existing monitoring solution
Check this blogpost on Monitoring Reporting Signal, Pingdom, Proby, Graphite, Monit , Pagerduty, Airbrake to see how they make a powerful team.
User tracking Metrics - Cloud
Clicks, Page view etc ...
Besides the application metrics, there is one other major player in web metrics. Google Analytics
I found several tools to get data out of it using the Google Analytics API
- Garb - A ruby wrapper for the google analytics API: https://github.com/vigetlabs/garb
- Gem for talking to Google Analytics API: https://github.com/rumble/gattica
- Google Analytics with Ruby and Garb: http://www.viget.com/extend/google-analytics-api-with-ruby-and-garb-making-it-even-easier
- More recent: Google Analytics Data Export API with Ruby + Gattica: http://www.seerinteractive.com/blog/google-analytics-data-export-api-with-rubygattica/2011/02/22
- Gattica - More recent fork by Deviantech, with goals and segment support: https://github.com/chrisle/gattica
With google Analytics there is always a delay on getting your data;
If you want to have realtime statistics/metrics checkout Gaug.es http://get.gaug.es :
- Use the Gauges gem: https://github.com/orderedlistinc/gauges-gem to import/export data
A/B Testing
Haven't really gotten into this, but well worth exploring getting metrics out of A/B testing
- Optimizely: http://www.optimizely.com
- Visual Website Optimizer: http://visualwebsiteoptimizer.com
- Google Web Optimizer: http://www.google.com/websizeoptimizer
Page render time
Another important to track is the page render time. This is well explained in the Real User Monitoring- Chapter 10 - Complete Web Monitoring - O'Reilly Media
Again Newrelic provides RUM : Real User Monitoring. See How we provide real user monitoring: A quick technical review for more technical info
- Keynote: monitoring like a real user experiences your website - http://www.keynote.com
- Real User Monitoring New Relic - http://newrelic.com/rum
- Tracking metrics - Velvet Metrics: http://www.velvetmetrics.com
Who needs a cloud anyway
Putting your metrics into the cloud can be very convenient , but it has downsides:
- most tools don't have way to redirect/replicate the metrics they collect internally
- that makes it hard to correlate with your internal metrics
- it's easy to get metrics in, but hard to get the full/raw data out again
- it depends on the internet , duh, and sometimes this fails :)
- or privacy or the volume of metrics just isn't possible to put it out in the cloud
Application Metrics - Non - Cloud
In his epic Metrics Anywhere, Codahale explains the importance of instrumenting your code with metrics. This looks very promising as this is really driven from the developers world:
Java
- CodaHale Metrics: https://github.com/codahale/metrics allows Capturing JVM- and application-level metrics.
- Simon Java - Simple Monitoring API: http://code.google.com/p/javasimon
- Stajistics a free monitoring and runtime performance statistics collection API for Java https://code.google.com/p/stajistics
- Parfait- Java performance framework: https://code.google.com/p/parfait
Or you can always use JMX to monitor/metrics from your application
- JMX - JR Ruby Jmx: https://github.com/enebo/jmxjr allows you to access the Mbeans as a ruby class
- An example using JMX and Jruby: https://github.com/nicksieger/advent-jruby
And with JMX-trans http://code.google.com/p/jmxtrans you can feed jmx information into Graphite, Ganglia, Cacti/Rrdtool,
Other
- Ruby Metrics Equivalent by JohnEwart: https://github.com/johnewart/ruby-metrics
- Ruby - FnordMetric https://github.com/paulasmuth/fnordmetric is a highly configurable (and pretty fast) realtime app/event tracking thing based on ruby eventmachine and redis. You define your own plotting and counting functions as ruby blocks!
- Pinba - Monitoring Php Processing using Timers - http://pinba.org/wiki/Main_Page
- A good overview post about collecting application metrics in java
- An opensource New Relic clone : https://github.com/devmen/FreeRelic
Esty style: StatsD
To collect various metrics, Etsy has created StatD https://github.com/etsy/statsd a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite.
There have been written clients in many languages php, java, ruby etc..
Other companies have been raving about the benefits of StatsD and for example Shopify has completely integrated it in their environment
- Ruby gems for registering metrics with Statsd - http://rubydoc.info/gems/fozzie
- A Statsd Server in Ruby - https://github.com/fetep/ruby-statsd
- A Statsd Client in Ruby - https://github.com/github/statsd-ruby
- Another Statsd Client in Ruby - http://github.com/bvandenbos/statsd-client
- A Statsd client that isn't a direct port- https://github.com/reinh/statsd
- Statd instrumentation via Metaprogramming Methods in Ruby - https://github.com/shopify/statsd-instrument
It's incredible to see the power and simplicity of this; I've created a simple Proof of Concept to extract the statsd metrics on ZeroMQ in this experimental fork
MetricsD https://github.com/tritonrc/metricsd tries to marry both Etsy's statsD and the Coda Hale / Yammer's Metrics Library for the JVM and puts the data into Graphite. It should be drop-in compatible with Etsy's statsd, although with added explicit support for meters (with the m type) and gauges (with the g type) and introduce the h (histogram) type as an alias for timers (ms).
User tracking - Non Cloud
Clicks, Page view etc ...
Here are some Open Source Web Analytics libraries. These are merely links, haven't investigated it enough, work in progress
- Open Web Analytics
- Grape Web Statistics
- Ruwa - Ruby on Rails Web Analytics
- Riopro/piwik - ruby gem
- JKraemer piwik-tracker
- Autometal - Piwik ruby gem
- Awstats Reader - in Python
Another tool worth mentioning for tracking endusers is HummingBird - http://hummingbirdstats.com/ . It is NodeJS based an allows for realtime web traffic visualization. To send metrics is has a very simple UDP protocol.
A/B Testing
At Arrrrcamp I saw a great presentation on A/B Testing by Andrew Nesbitt(@teabass. Do watch the video to get inspired!
He pointed out several A/B testing frameworks:
- ABingo : Rails A/B Testing - http://www.bingocardcreator.com/abingo
- Seven Minute Abs: Rails A/B Testing - https://github.com/paulmars/seven_minute_abs
- Vanity: Experiment Driven Development - http://vanity.labnotes.org/
And presented his own A/B Testing framework: Split - http://github.com/andrew/split
It would be interesting to integrate this further into traditional Monitoring/Metrics tools. View metrics per new version/enabled flags etc... In a Nutshell food for thought.
Page render time
For checking the page render time, I could not really found Open Source Alternatives.
There is a page by Steve Sounders about Episodes http://stevesouders.com/episodes/paper.php. Or you can track your Apache logs with Mod Log I/O
Conclusion
It's exciting to see the cross over between both development, operations and business. Up until now only New Relic has a very well integrated suite for all metrics. Hope the internal solutions catch up.
Now that we have all that data, it's time to talk about dashboards and visualization. On to the next blogpost.
If you are using other tools, have ideas, feel free to add them in the comments.
Graphite, JMXTrans, Ganglia, Logster, Collectd, say what ?
Given that @patrickdebois is working on improving data collection I thought it would be a good idea to describe the setup I currently have hacked together.
(Something which can be used as a starting point to improve stuff, and I have to write documentation anyhow)
I currently have 3 sources , and one target, which will eventually expand to at least another target and most probably more sources too.

The 3 sources are basically typical system data which I collect using collectd, However I`m using collectd-carbon from https://github.com/indygreg/collectd-carbon.git to send data to Graphite.
I`m parsing the Apache and Tomcat logfiles with logster , currently sending them only to Graphite, but logster has an option to send them to Ganglia too.
And I`m using JMXTrans to collect JMX data from Java apps that have this data exposed and send it to Graphite. (JMXTrans also comes with a Ganglia target option)
Rather than going in depth over the config it's probably easier to point to a Vagrant box I build https://github.com/KrisBuytaert/vagrant-graphite which brings up a machine that does pretty much all of this on localhost.
Obviously it's still a work in progress and lots of classes will need to be parametrized and cleaned up. But it's a working setup, and not just on my machine ..
#monitoringsucks and we’ll fix it !
If you are hacking on monitoring solutions, and want to talk to your peers solving the problem
Block the monday and tuesday after fosdem in your calendar !
That's right on february 6 and 7 a bunch of people interrested to fix the problem will be meeting , discussing and hacking stuff together in Antwerp
In short a #monitoringsucks hackathon
Inuits is opening up their offices for everybody who wants to join the effort Please let us (@KrisBuytaert and @patrickdebois) know if you want to join us in Antwerp
Obviously if you can't make it to Antwerp you can join the effort on ##monitoringsucks on Freenode or on Twitter.
The location will be Duboistraat 50 , Antwerp
It is about 10 minutes walk from the Antwerp Central Trainstation
Depending on Traffic Antwerp is about half an hour north of Brussels and there are hotels at walking distance from the venue.
Plenty of parking space is available on the other side of the Park
Monitoring Wonderland Survey – Nagios the Mighty Beast
Controlling the tool everybody hates, but still uses
This blog post mainly contains my findings on getting data in and out of Nagios. That data can be status information, performance information and notifications. At the end there are some pointers on ruby integration with Pingdom and Jira
The idea is similar to my previous blogposting Monitoring Wonderland Survey - Metrics - API - Gateways: I want to share/open up this data for others to consume, preferably on a bus like system and using events instead of polling.
Nagios - IN
Writing Checks in Ruby
If you want to get data into Nagios, you have to write a check. These are some options for doing this in ruby:
- Nagios Plugins in Ruby: https://github.com/dusty/ruby_nagios
- Ruby to create nagios probes : https://github.com/hobodave/nagios-probe
- A Ruby Gem to easily create Nagios Plugins : https://github.com/jhstatewide/ruby-nagios-plugin
- A Proxy to collect values for Nagios, JMX: http://jrds.fr/nagios
Projects that link testing and monitoring:
- Nagios Test Framework: https://github.com/marineam/nagcat
- Pager Unit, Nagios Alternative to look like unit tests: https://github.com/rcrowley/pagerunit
- Cucumber - Nagios : http://auxesis.github.com/cucumber-nagios/
Transporting check results
Nagios has many ways to collect the results of these checks:
- Using Send NSCA - Ruby gem: https://github.com/kevinzen/send_nsca
- Or using NRPE - PDF if enabled
You can test NRPE with the standalone NRPE runner
And maybe schedule the Nagios NRPE checks with Rundeck
If you don't like the spawning of separate ruby processes for each check, you can leverage Metis:https://github.com/krobertson/metis
Transport over a bus system
Instead of using the traditional provided interfaces, people are starting to send the check information over a bus for further handling:
- Krolyk is a daemon which consumes Nagios check results from RabbitMQ and writes these to the Nagios command pipe. http://www.smetj.net/wiki/Krolyk
Moncli is a generic MONitoring CLIent which executes and processes requests on an external system in order to interact with the host's local information sources which are normally not available over the network. http://www.smetj.net/wiki/Moncli
- Sensu - Uses RabbitMQ and Pub/Sub Redis to scale the checks collection https://github.com/sonian/sensu
Look ma, no Nagios Server needed
Some people have taken an alternative approach, re-using the checks libraries but reusing them in their own framework.
- Sensu : Framework that uses Nagios checks and Rabbitmq and Pub/Sub Redis: https://github.com/sonian/sensu and it's dashboard https://github.com/sonian/sensu-dashboard
- Sentry : a Nagios clone in Ruby https://github.com/alexch/sentry
- Eyes : a project to enable quick, simple, and API enabled monitoring and data collection. A mailing list for eyes is available at http://groups.google.com/group/eyes-monitoring Tracker at https://www.pivotaltracker.com/projects/274785 Wraps Nagios plugins, inside django http://packages.python.org/eyes/
Nagios - OUT
Reading Status
As there is no official API to extract status information from Nagios, people have been implementing various ways of getting to the data:
Scraping the UI
Well if we really have to ...
Parsing status.dat file
All status information from Nagios is stored in the .dat file, so several people have started writing parsers for it, and exposing it as an API
- A REST frontend in Sinatra - https://github.com/ohookins/sinagios
- Another REST API - https://github.com/kerphi/RESTnag
- Nagios API: incomplete but mostly state, downtime, results: https://github.com/xb95/nagios-api
- A CLI tool and Ruby library that parses your status log file and let you query it for information or create external commands: https://github.com/ripienaar/ruby-nagios
- Old Ruby Interface for Status-file data http://rubyforge.org/projects/nag-ruby
- Nagira - Nagios Restfull API - https://github.com/dmytro/nagira/tree/master/lib
- NagiosR - exposes nagios status via csv and json using sinatrarb: https://github.com/discordianfish/nagiosr/blob/master/nagiosr.rb
Nagios-Dashboard parses the nagios status.dat file & sends the current status to clients via an HTML5 WebSocket. The dashboard monitors the status.dat file for changes, any modifications trigger client updates (push). Nagios-Dashboard queries a Chef server or Opscode platform organization for additional host information.
- Nagios Dashboard - https://github.com/portertech/nagios-dashboard
Parsing the log files
- A Ruby Cli tool - to parse log file Nagios: https://github.com/ripienaar/ruby-nagios
Using Checkmklivestatus
A better option to get adhoc status is to query Nagios via CheckMK_Livestatus http://mathias-kettner.de/checkmk_livestatus.html It is a Nagios Event Broker that hooks directly into the Nagios Core, allowing it direct acces to all structures and commands NEB's are very powerfull, and for more information look a the Nagios book - event broker section
Tools that use this API :
- https://github.com/RECIA/nagios_mklivestatus
- https://github.com/sni/Monitoring-Livestatus
- https://github.com/zenops/livestatus
- Nagios CLI, Livestatus, Python: https://github.com/ning/ngsh
- Nagios SSH Command Pipe: https://github.com/scy/nscp
- Nagios Light Web Interface with a JSON API, writen in Ruby: https://github.com/rhaamo/naglight
- Nagios GUI Framework for Livestatus: https://github.com/Bastian-Kuhn/NagUI
Quering the database/NDO
An alternative NEB handler is NDO Utils, NDO2DB. It stores all the information into a database. Or on using NDO2FS - NDO in Json or filesystem on a filesystem.
Hooking into performancehandler
RI Pienaar shows us how to hook into a process-service-perfdata handler and logs that information to a file:
The advantage is that we can get the information evented instead of having to poll the status of information. In other words ready to be put on message bus for others to read.
Listening in to events with NEB/Message queue
In order to get the events as fast as possible, I looked into using a NEB to put information on a message queue directly.
I found the following sample code:
Marius Sturm had Nagios-ZMQ https://github.com/mariussturm/nagios-zmq that allowed to get the events directly on the queue. I extended to not only read the check results or performance data, but also the notifications.
It seems Icinga is taking a similar approach with the Icinga - ZMQ - icingamq. This to enable High performance Large Scale Monitoring
An interesting difference is that is will also expose the CheckMklivestatus API directly over ZeroMQ
Adding Hosts dynamically
A bit of side track, but one of the things a lot of people struggle with is dynamically adding hosts/servers to Nagios , without restarting it. The following are links that kind of try to solve this problem, but none solves it completely. It seems most people solve this by some interaction with a Configuration Management system and a system inventory.
- Adding hosts & Services to Nagios Daemon
- Inventory Check , via Check_mk
- Creating Check_mk nagios configurations with Puppet
- Check_MK via OMD
- Nagios Automation - Sinatra- and Resque-based API endpoint for Chef Nagios automation. The recipe that generates the API calls can be found in the Cloudspace Ops repo.
To read the config and write the configs, people have writing various parsers:
- Parsing Nagios Objects
- Nagios config file parser
- Nagios Config file generator
- Nagios parser and generator - old
The reload problem doesn't look like an easy one to solve: one could create NEB that manipulates the memory host/service structures but it will also need to persist that on disk. If anyone has a good solution, please let us know!
Notification handling
There a lot more problems with Nagios, but people still use it's notification and acknowledgement system. Some interesting things I found:
- Angelia - Tool to facilitate the development of nagios notification methods using many different protocols and delivery system : https://github.com/ripienaar/angelia
- Nagios Aggregate Notification System , to solve mass duplication of alerts: https://github.com/bobpattersonjr/nans
Pingdom
If pingdom is your game, here are some API to information to Pingdom, and read the status
- How to inject your own checks into pingdom - http://jonsview.com/how-i-use-pingdoms-http-custom-feature
- Pingdom Restful API Client: https://github.com/mtodd/pingdom-client
I could not find a way to make this evented , we'll have to create
Jira Notificiation
I found 4 libraries to interact with Jira - from ruby:
- Jira4r - Jira for ruby: http://jira4r.rubyhaus.org/
- Nagios Jira Ticket Generator https://github.com/MrPink/Nagios-Jira-ticket-generator
- Jira Ruby: https://github.com/trineo/jira-ruby
- Jira Sample Client - JIRA soap interface/Ruby: http://svn.atlassian.com/svn/public/contrib/jira/jira-rpc-samples/src/ruby/jira4rsample.rb
Conclusion:
- We can get a long way to automate getting data in and out of Nagios
- Exposing the API through the Livestatus works really well
- Using the NEB Nagios-ZMQ will allow us to get the information in an evented way
- Adding hosts dynamically still seems to be an issue
By listening in on the events over a queue, we could create a self-servicing for nagios events similar to Tattle, which does the same for Graphite:
- Tattle - Self servicing Alerts - Escape Conference - Draco2002: http://www.slideshare.net/Draco2002/graphite-tattle
Next blogpost we'll move up the stack a bit and start investigating options for application and enduser usage metrics.
Monitoring Wonderland Survey – Metrics – API – Gateways
Update 4/01/2012: added ways to add metrics via logs, java pickle graphite feeder
One tool to rule them all? Not.
If you are working within an enterprise , chances are that you have different metric systems in place: You might have some Cacti, Ganglia, Collectd, etc... due to historical reasons, different departments,
This reminded me of the situation while I was working in Identity Management: you might have an LDAP, Active Directory, local HR database etc. There would be plans and discussions of using one over the other, and gateways would need to be written. I learned a few lessons there:
- have as few sources/stores of information as possible
- DON't try to chase the one tool to rule them all, aka don't use a tool for something it's not made for
- make it self-servicing to user and automate processes
1 to 1 gateways
Take the new Metrics hotness Graphite as an example, it has some nice graphing advantages over other tools . So people wonder , should I migrate my Ganglia, Collectd to Graphite? Graphite doesn't come with elaborate collection scripts for memory/disk/etc ... , so we have to rely on other tools like Cacti,Munin,Collectd,Ganglia to first collect the data.
So we start writing gateways to get data into Graphite:
- Collectd -> Graphite plugin (using perl)
- Collectd -> Graphite (Loggly - Nodejs) - proxy httpd
- Diamond-gmond - python - ganglia -> graphite
But what happens if we also use Opentsdb for storing long term data ? We have to re-implement those gateways:
Issue 1 : Effort duplication
This just seems like a waste of energy implementing the protocol in every tool.This sure isn't the first time this happens in history: the same thing happened for Collectd -> Ganglia Plugin
If you look at the data that is transmitted it is actually pretty much the same:
a metric name, value, timestamp, optionally hostname, some metadata tags
So we could easily envision a 'universal' format that would be used to translate from and to.
Ganglia <-> Intermediate format <-> Graphite
Collectd <-> Intermediate format <-> Opentsdb
With this intermedia format, we would only have to write one end of the equation once.
I started thinking of this like an ffmpeg for monitoring
Issue 2: Difficult to hook in additional listeners
Let's add another system that wants to listen into the metrics, something like Esper, Nagios alerting, some Dataware house tools etc... We could reuse the libraries from end to the other, but we'll have to add more gateways and put these in place everytime.
A better approach would be to use a message bus approach: every tools puts and listens on a bus and gets the data it needed. RI Pienaar has written about this approach extensively in his Series on Common Messaging Patterns. Aso John Bergmans has a great post on using AMQP and Websockets to get realtime graphics.
Some of the tools already have Message queue integrations, but there seems to be a common intermediate format missing
- Graphite - Rabbit Mq integration: http://www.somic.org/2009/05/21/graphite-rabbitmq-integration/
- Graphite - AMQP integration: https://code.launchpad.net/~lucio.torre/graphite/graphite-add-rabbitmq/+merge/16816
Graphilia - Graphite AMQP: https://github.com/fetep/graphlia/blob/master/graphlia.py
Collectd - Plugin:AMQP - Transmit or receive value by collectd: http://collectd.org/wiki/index.php/Plugin:AMQP
- Collectd- ZeroMQ: https://github.com/deactivated/collectd-write-zmq/
As a proof of concept I've created :
- Ganglia-Zeromq gateway in Ruby: https://github.com/jedi4ever/gmond-zmq
- Collectd-Zeromq gateway in Ruby: https://github.com/jedi4ever/collectd-zmq
Building blocks
In this section I'll look for API's (ruby oriented) to get data in and out of the different metrics systems:
Graphite - IN
Sending metrics from ruby to Graphite:
- Graphite Gem: https://github.com/otherinbox/graphite
- Simple graphite Gem: https://github.com/imeyer/simple-graphite
These both implement the Simple Protocol, but for high performance we'd like to use the batching facility through the Pickle Format. I could not find a Pickle gem for ruby, but his could work through Ruby-Python gateway http://rubypython.rubyforge.org/.
Faster - a Java Netty based graphite relay takes the same approach https://github.com/markchadwick/graphite-relay
Another way to get your data into graphite is using Etsy's Logster https://github.com/etsy/logster
Mike Brittain greatly explains it's use in Take my logs... Please! - A velocity Online Conference SessionVideoPDF
Graphite - OUT
To get all the data out of Graphite is impossible through the standard API. You get a graph out as Raw data, but that hardly counts.
The best option seems to be to listen in to the graphite - udp receiver and duplicate the information onto a message bus.
An alternative might be to directly read from the Whisper storage, inspiration for that can be found in:
- Whisper - Ruby gem: https://github.com/eric/whisper-rb
- Ruby interface to Graphite's Whisper file format : https://github.com/mleinart/graphite_storage
- Merging Whisper files : https://github.com/damaex17/whisper-merge
- Hoard - A Whisper alike for Nodejs: https://github.com/cgbystrom/hoard
Opentsdb - IN
I could not find any ruby gem that implements the Opentsdb protocol for sending data, but creating one should be trivial. Opentsdb just use a plain TCP socket to get the data in
Opentsdb - OUT
Getting data out of Opentsdb suffers the same problem as Graphite: you can do queries on specific graph data
- Ruby gem - opentsdb API: https://github.com/j05h/continuum
But you can't get it out, maybe if you directly interface with the Hbase/Java API. So again the best bet is to create a listener/proxy for the simple TCP protocol.
Ganglia - IN
Sending metrics to Ganglia is easy using the gmetric shell command. Early days code describing this can still be found at http://code.google.com/p/embeddedgmetric/
Igrigorik has written up nicely on how to use the Gmetric Ruby gem to send metrics
- Gmetric Ruby Gem - https://github.com/igrigorik/gmetric
- An HTTP Wrapper to send metrics: https://github.com/garethr/gmetric-web
If you want to feed in log files into ganglia Logtailer might be your thing https://bitbucket.org/maplebed/ganglia-logtailer
Ganglia - OUT
Vladimir describes the options while he explains on how to get Ganglia data to graphite
Option 1 is to poll the Gmond over TCP and get the XML from it's current data:
Options 2 is to listen into the UDP protocol as a additional receiver.
I implemented both approaches in the https://github.com/jedi4ever/gmond-zmq
Note: As a side effect I found that the metrics send to the UDP are actualy more acurate then the values when you query the XML.
Collectd - IN
So send metrics to Collectd, you can use ruby gem from Astro that implements most of the UDP protocol
- Collectd Ruby gem by Astro - https://github.com/astro/ruby-collectd/
Collectd - OUT
I give Collectd for the price of best output.
It currently implements different writers:
- Network plugin
- UnixSock plugin
- Carbon plugin
- CSV
- RRDCacheD
- RRDtool
- Write HTTP plugin
And the deactived ZeroMQ - https://github.com/deactivated/collectd-write-zmq
The Binary Protocol http://collectd.org/wiki/index.php/Binary_protocol is pretty simple to listen into.
Munin
If you happen to use Munin, here's some inspiration, but I haven't researched it much
- API Client - https://github.com/sosedoff/munin-ruby
- Munin Network Protocol - http://munin-monitoring.org/wiki/network-protocol
- Rails Plugin that implements the munin-node protocol to allow Rails Internals to be graphed by Munin - https://github.com/jamesotron/Muninator
- Munin Node - Ruby: http://www.devco.net/archives/2011/10/02/interact-with-munin-node-from-ruby.php
Circonus
If you happen to use Circonus, here's some inspiration, but I haven't researched it much
RRD interaction from ruby
For those who want to read and write directly from RRD's in ruby, please have fun:
- FFI -RRD - https://github.com/morellon/rrd-ffi
- RRD-RB Ruby - RRD - http://code.google.com/p/rrd-rb/
- RRD-graph-ruby - https://github.com/ion1/rrd-graph-ruby
- RRDtool - rrdruby - http://oss.oetiker.ch/rrdtool/prog/rrdruby.en.html
Alert on metrics:
With all the tools in and out, and a unified intermediate format, it will be trivial to rewrite the traditional alert check tools to listen into the bus for values. This means you can listen into for your Nagios, your ticket system, your pager system etc.. from the same source.
Graphite
Opentsdb
- Check_TSD - Opentsdb : http://opentsdb.net/nagios.html
Ganglia
- http://blog.vuksan.com/2011/04/19/use-your-trending-data-for-alerting/
- https://github.com/mconigliaro/check_ganglia_metric
- https://github.com/daniyalzade/nagios-ganglia-plugin
- https://github.com/larsks/check_ganglia
New Relic
https://github.com/kogent/check_newrelic
Conclusion
It should be feasible to create an intermediate format and reuse some of these libraries to implement both IN and OUT functionality. Why not create a Fog for monitoring information? Like implements metric receive, send,
Next stop Nagios because it deserves a blogpost on it's own ...
Monitoring Wonderland Survey – Introduction
Introduction
While Automation is great to get you going and doing things faster and reproducible, Monitoring/Metrics are probably more valuable for learning and getting feedback from what's really going on. Matthias Meyer describes it as the virtues of monitoring. Nothing new, if you have been listening to John Allspaw on Metrics Driven Engineering (pdf), essentially putting the science back in IT as Adam Fletcher noted at the Boston devopsdays openspace session on What does a sysadmin look like in 10 years
Eager to help
Over the years I've done my fair share of monitoring setups, but the last years I was more focused on Automation. I would automate the hell out of any monitoring system the customer had. But after a while, this felt like standing on the sideline too much for me. This feeling got amplified by the Monitoring Sucks initiative of John Vincent: an initiative to improve the field where we can. The initiative has already spun some very good blogpost and one of the first blogposts monitoring sucks watch your language where they try to create a common vocabulary , reminded me a lot of the early 'what is' devops postings. So after Jason Dixon said, Monitoring Sucks, Do something about It , I decided to widen my focus again from automation to monitoring. And I found a great partner in Atlassian.
I'm certainly not the first person to do this, but I'm eager to help in the space. People like RI Pienaar have done some amazing ground work thinking about Monitoring Frameworks and making them Composable Architectures. One of the exiting areas, I'd like to focus on , is trying to make monitoring/metrics as easy as 'monitoring up' for developers and bring the traditionally operational tools in development land to better understand their application. We learned from configuration management that having common tools and a common language greatly helps overcome the devops divide.
Before jumping in the space, we decided to research the existing space extensively with its problems and solutions. This blogpost series is a summmary of these finding and will therefore will contain a lot of links.
Non technical reading
This series of blogposts is tools focused, not monitoring approach oriented, more on that in later posts, but for now I'll refer you to :
- Web Operations book (where I have a chapter on Monitoring)
- The Art of Capacitity Planning
- Complete Web Monitoring: Watching your visitors, performance communities, and competitors
Note:
- You will find that some tools were more predominantly researched, that's because the research was done from the perspective of Atlassian's current and future metrics/monitoring environment.
- Also you will notice a slant towards ruby libraries, that's mainly because I feel most productive in it and I'm thinking integration with chef/puppet/fog/vagrant etc.
- the main focus will be on Open Source Solutions, where available and commercially wherever there is a gap.
Meet the players
For people new in the field, I'd like to give a quick overview on the current players in the field , together with their official links and where possible links to books available:
A good actual overview can be found in the presentation of Jason Dixon's Trending with Purpose and Joshua Barratt - Getting more signal from your noisePDF I especially liked his approach to look at these tools from the Collect - Transport - Process - Store - Present perspectives.
Metrics
In the 'old' days, people first focused on the collect and transport problem. The standard for timeseries Storage was RRD Round Robin Database, and people would choose their metrics tools based on the collection scripts that were available. (Similar to how people choose cloud or config management it seems)
- Cacti: http://www.cacti.net/ - [Cacti 0.8 Network Monitoring book]
- Munin: http://munin-monitoring.org/
- Collectd: http://collectd.org/
As the number of servers started to grow, people wanted to have a scalable way of collecting ,aggregating and transporting the data.
- Ganglia: http://ganglia.sourceforge.net/
Even with the help of RRD cache, the storage of all these metrics was becoming the new bottleneck, so alternatives had be found. So Graphite introduced Whisper and Opentsdb decided to build on top of Hadoop And as the volume of data was increasing, it was begging for a self servicing way for visualization of the data.
- Graphite: http://graphite.wikidot.com/ - [Chapter in The Architecture of Open Source Applications book]
- Opentsdb: https://github.com/stumbleupon/opentsdb - [OpenTSDB chapter in Professional Nosql book]
Alerting, notification, availability
All these metric tools kind of ignore the alerting, notification and acknowlegement and rely on the real monitoring systems. So you need to complement them with some warning system like the following:
- Nagios: http://www.nagios.org/ - [Nagios 3 - Enterprise Network Monitoring book]
- Icinga: https://www.icinga.org
- Zabbix: http://www.zabbix.com/ - [Zabbix 1.8 Network Monitoring book]
- Zenoss: http://www.zenoss.com/ - [Zenoss Core 3.x Network and System Monitoring book]
- Reconnoiter : http://labs.omniti.com/labs/reconnoiter
Note that most of them are suffering from the scaling perspective and flexibility and graphical overview.
Beyond servers , to applications , to business
Now that we have gotten better at monitoring and metrics of servers, we are seeing better integration with application and business metrics:
- New Relic: http://newrelic.com/
- Statsd/Etsy: https://github.com/etsy/statsd
- Jmxtrans: http://code.google.com/p/jmxtrans/
The next blogposts will contain more meat of tools surrounding, enhancing, bypassing these 'traditional players'. Stay tuned...