Category → metrics
While working on the Devops Cookbook with my fellow authors Gene Kim,John Willis,Mike Orzen we are gathering a lot of "devops" practices. For some time we struggled with structuring them in the book. I figured we were missing a mental model to relate the practices/stories to.
This blogpost is a first stab at providing a structure to codify devops practices. The wording, descriptions are pretty much work in progress, but I found them important enough to share to get your feedback.
Devops in the right perspective
As you probably know by now, there are many definitions of devops. One thing that occasionally pops up is that people want to change the name to extend it to other groups within the IT area: star-ops, dev-qa-ops, sec-ops, ... From the beginning I think people involved in the first devops thinking had the idea to expand the thought process beyond just dev and ops. (but a name bus-qa-sec-net-ops would be that catchy :).
I've started reffering to :
- devops : collaboration,optimization across the whole organisation. Even beyond IT (HR, Finance...) and company borders (Suppliers)
- devops 'lite' : when people zoom in on 'just' dev and ops collaboration.
As rightly pointed out by Damon Edwards , devops is not about a technology , devops is about a business problem. The theory of Contraints tells us to optimize the whole and not the individual 'silos'. For me that whole is the business to customer problem , or in lean speak, the whole value chain. Bottlenecks and improvements could be happen anywhere and have a local impact on the dev and ops part of the company.
So even if your problem exists in dev or ops, or somewhere between, the optimization might need to be done in another part of the company. As a result describing pre-scriptive steps to solve the 'devops' problem (if there is such a problem) are impossible. The problems you're facing within your company could be vastly different and the solutions to your problem might have different effects/needs.
If not pre-scriptive, we can gather practices people have been doing to overcome similar situations. I've always encouraged people to share their stories so other people could learn from them. (one of the core reasons devopsdays exists) This helps in capturing practices, I'd leave it in the middle to say that they are good or best practices.
Currently a lot of the stories/practices are zooming in on areas like deployment, dev and ops collaboration, metrics etc.. (Devops Lite) . This is a natural evolution of having dev and ops in the term's name and given the background of people currently discussing the approaches. I hope that in the future this discussion expands itself to other company silos too: f.i. synergize HR and Devops(Spike Morelli) or relate our metrics to financial reporting.
Another thing to be aware of is that a system/company is continously in flux: whenever something changes to the system it can have an impact; So you can't take for granted that problems,bottle-necks will not re-emerge after some time. It needs continuous attention. That will be easier if you get closer to a steady-state, but still, devops like security is a journey, not an end state.
Beyond just dev and ops
Let's zoom in on some of the practices that are commonly discussed: the direct field between 'dev' and 'ops'.
In most cases, 'dev' actually means 'project' and 'ops' presents 'production'. Within projects we have methodologies like (Scrum, Kanban, ...) and within operations (ITIL, Visble Ops, ...). Both parts have been extending their project methodology over the years: from the dev perspective this has lead to 'Continous Delivery' and from the Ops side ITIL was extended with Application Life Cycle (ALM). They both worked hard on optimize the individual part of the company and less on integration with other parts. Those methodologies had a hard time solving a bottleneck that outside their 'authority'. I think this where devops kicks in: it seeks the active collaboration between different silos so we can start seeing the complete system and optimize where needed, not just in individual silos.
In my mental model of devops there are four 'key' areas:
- Area 1 : Extend delivery to production (Think Jez Humble) : this is where dev and ops collaborate to improve anything on delivering the project to production
- Area 2 : Extend Operation to project (Think John Allspaw) : all information from production is radiated back to the project
- Area 3 : Embed Project(Dev) into Operations : when the project takes co-ownership of everything that happens in production
- Area 4 : Embed Production(Ops) into Project : when operations are involved from the beginning of the project
In each of these areas there will be a bi-directonal interaction between dev and ops, resulting in knowledge exchange and feedback.
Depending on where your most pressing 'current' bottleneck manifests itself, you may want to address things in different areas. There is no need to first address things in area1 than area2. Think of them as pressure points that you can stress but requiring a balanced pressure.
Area 1 and Area2 tend to be heavier on the tools side , but not strictly tools focused. Area3 and Area4 will be more related to people and cultural changes as their 'reach' is further down the chain.
When visualized in a table this gives you:
As you can see:
- the DEV and OPS part keep having their own internal processes specific to their job
- the two processes are becoming aligned and the areas extend both DEV and OPS to production and projects
- it's almost like a double loop with area1 and area2 as the first loop and area3 and area4 as the second loop
Note 1: these areas definitely need 'catchier' names to make them easier to remember. Note 2: Ben Rockwoods post on "The Three Aspects of Devops" lists already 3 aspects but I think the areas make it more specific
In each of these areas, we can interact at the traditional 'layers' tools, process, people:
So whenever I hear story , I try to relate it's practice to one of these areas as described above and the layer it's adressing. Practices can have an impact at different layers so I see them as 'tags' to quickly label stories. Another benefit is that whenever you look at an area, you can ask yourself what practices we can do to improve each of these layers. To have a maximum impact on each of the layers, it's clear that the approach needs to be layered in all three.
The ultimate devops tools would support the whole people and process in all of these areas, not just in Area1 (deployment) or Area2 (monitoring/metrics). Therefore a devops toolchain with different tools interacting in each of the areas makes more sense. Also the tool by itself doesn't make it a devops tool: configuration mangement systems like chef and puppet are great, but when applied in Ops only don't help our problem much. Of course Ops gets infrastructure agilitity, but it isn't until it is applied to the delivery (f.i. to create test and development environments) that it becomes 'devops'. This shows that the mindset of the person applying the tool makes it a devops tool, not the tool by itself.
Area Maturity Levels
Now that we have the areas and layers identified, we want to track progress as we start solving our problems and are improving things.
CMMI levels allow you to quantify the 'maturity' of your process. That addresses only one layer (although an equally important one). In a nutshell CMMI describes the different levels as:
- Initial : Unpredictable and poorly controlled process and reactive nature
- Managed : Focused on project and still reactive nature
- Defined : Focused on organization and proactive
- Quantively Managed : Measured and controller approach
- Optimizing : Focus on Improvement
All these levels could be applied to dev , ops or devops combined. It gives you an idea at what level process is in, while you are optimizing in an area.
An alternative way of expressing maturity levels is used by the Continuous Integration Maturity Model. It puts a set of practices in levels of maturity: (industry consensus)
- Intro : using source control ...
- Novice : builds trigger by commit ...
- Intermediate : Automated deployment to testing ..
- Advanced : Automated Functional testing ...
- Insane : Continuous Deployment to Production ...
Instead of focusing on the proces only , it could be applied to a set of tools, process or people practices. What people consider the most advanced would get the highest maturity level.
Practices, Patterns and principles
A practice could be anything from an anecdotal item to a systemic approach. Similar practices can be grouped into patterns to elevate them to another level. Similar to the Software Design Patterns we can start grouping devops practices in devops patterns.
Practices and patterns will rely on principles and it's these underlying principles that will guide you when and you to apply the pattern or practice. These principles can be 'borrowed' from other fields like Lean, Systems Theory etc, Human Psychology. The principles are what the agile manifesto is about for example.
Slowly we will turn the practices -> patterns -> principles .
Note: I'm wondering if there will be new principles that will emerge from from devops itself or it will be apply existing principle to a new perspective.
A few practical examples:
Below are a few example 'practices' codified in a standard template. The practices/patterns/principles are not yet very well described. The point is more that this can serve as a template to codify practices.
The idea is to list metrics/indicators that can tracked. The numbers as such might be not be too relevant but the rate of change would be. This is similar to tracking the velocity of storypoints or the tracking of mean time to recovery.
Note: I'm scared of presenting these as metrics to track, therefore I call them indicators to soften that.
Examples would be :
- Tools Layer : Deploys/Day
- Process Layer : Number of Change Requests/Day
- People Layer : People Involved per deploy
This is not yet fleshed out enough , I'm guessing it will be based on my research done for my Velocity 2011 Presentation (Devops Metrics)
To present progress during your 'devops' journey you can put all these things in a nice matrix, to get an overview on where you are at optimizing at the different layers and areas.
Obviously this only makes sense if you don't lie to yourself, your boss, your customers.
Project Teams, Product Teams and NOOPS
Jez Humble often talks about project teams evolving to product teams: largere silos will split of not by skill, but for product functionality they are delivering. Splitting teams like that, has the potential danger of creating new silos. It's obvious these product teams need to collaborate again. You should treat other product teams are external dependencies, just like other Silos. The areas of interaction will be very similar.
Also you can see the term NOOPS as working with product teams outside your company, like you rely on SAAS for certain functions. It's important not only to integrate in each of the areas on the tools layer, but also on the people and process layer. Something that is often forgotten. Automation and abstraction allows you to go faster but when things fail or even changes occur, synchronisation needs to happen.
CAMS and areas
The CAMS acronym (Culture, Automation, Measurement, Sharing) could be loosely mapped onto the areas structure:
- Automation seems to map to Area1: the delivery process
- Measurement seems to map to Area2: the feedback process
- Culture to Area3 : embedded devs in Production
- Sharing to Area4: embedded ops in Projects
Of course automation, measurement, culture and sharing can happen in any of the areas, but some of the areas seem to have a stronger focus on each of these parts.
Devops areas, layers and maturity levels, give us a framework to capture new practices stories and it can be used to identify areas of improvements related to the devops field. I'd love feedback on this. If anyone wants to help, I'd like to bring up a website where people can enter their stories in this structure and make it easily available for anyone to learn. I don't have too much CPU cycles left currently , but I'm happy to get this going :)
P.S. @littleidea: I do want to avoid the FSOP Cycle
A picture tells more than a ...
Several interesting books have been written about visualization:
- Designing with Data
- Visualize this
- Information Dashboard Design - Effective Communication
- Design by Nature
- Data Visualizations
- Chapter on visualization in Big Data Glossary Book
- The visual Display of Quantative Information
- Envisioning Information
- Visual and Statistical Thinking
Dashboard written for specific metric tools
Graphs are Graphite's killer feature, but there's always room for improvement:
- Graphiti - https://github.com/paperlesspost/graphiti an alternative well designed UI. To see it in action watch this presentation Metrics And you
- Pencil - https://github.com/fetep/pencil
- RI Pienaar has created Gdash - Graphite: version control, add graphs dsl, easy bookmarks
- Charcoal - Charcoal: Simple Graphite Templates
Graphs in Opentsdb are based on Gnuplot
- Opentsdb- Dashboard in Nodejs - https://github.com/clover/opentsdb-dashboard
- Otus - https://github.com/otus/otus - Web Dashboard build on top of Hadoop/Opentsdb for monitoring hadoop cluster -
- The New Ganglia Web - 2 is pretty slick!
- Visage - Web Interface to collectd - RRD
- a CollectD viewer by John Bergmans usine Websockets - AMQP - Collectd - realtime view: http://bergmans.com/WebSocket/collectdViewer.html
Nagios also has a way to visualize metrics in it's UI
With all these different systems creating graphs, the nice folks from Etsy have provided a way to navigate the different systems easily via their dashboard - https://github.com/etsy/dashboard
I also like the Idea of Embeddable Graphs as http://explainum.com implements it
Development frameworks for visualization
Generic data visualization
- Protovis-js : http://code.google.com/p/protovis-js
- Processing-js: http://processingjs.org/
- Raphael-js: http://raphaeljs.com/
- Flare: http://flare.prefuse.org/
- Google Fusion Tables : http://www.google.com/fusiontables
- Polymaps: http://polymaps.org/ex/
- Yahoo UI elements: http://developer.yahoo.com/yui
- Gephi: http://gephi.org
- Graphiz: http://www.graphviz.org
Time related libraries
To plot things many people now use:
For timeseries/timelines these libraries are useful:
- Simile Timeline - http://www.simile-widgets.org/timeline/
- Simile Timeline in Google Charts - http://code.google.com/apis/chart/interactive/docs/gallery/annotatedtimeline.html
- Dygraphs - http://dygraphs.com/ - that produces produces interactive, zoomable charts of time serie
Annotations of events in timeseries:
On your graphs you often want event annotated. This could range from plotting new puppet runs , tracking your releases to everything that you do in the proces of managing your servers. This is what John Allspaw calls Ops-Metametrics
These events are usually marked as vertical lines.
- RRD Vertical - works for Cacti, Munin, Collectd ... - http://blog.vuksan.com/2010/06/28/overlay-deploy-timeline-on-your-ganglia-graphs
- Ganglia - Overlay Events: http://ganglia.info/?p=382
- Graphite - Draws as infinite: http://readthedocs.org/docs/graphite/en/latest/functions.html
Graphite - Events to facilitate this: https://github.com/agoddard/graphite-events
One thing I was wondering is that with all the metrics we store in these tools, we store the relationships between them in our head. I researched for tools that would link metrics or describe a dependency graph between them for navigation.
We could use Depgraph - Ruby library to create dependencies - based n graphviz to draw a dependency tree, but we obviously first have to define it. Something similar to the Nagios dependency model (without the strict host/service relationship of course)
While all the previously described metric systems have easy protocols, they tend to stay in Sysadmin/Operations land. But you should not stop there. There is a lot more to track than CPU,Memory and Disk metrics. This blogpost is about metrics up the stack: at the Application Middleware, Application and the User Usage.
To the cloud
Maybe grumpy sysadmins have scared the developers and business to the cloud. It seems that the space of Application metrics, whether it's Ruby, Java , PHP is being ruled today by New Relic In a blogpost New Relic describes serving about 20 Billion Metrics A day.
- The New Relic - Ruby gem https://github.com/newrelic/rpm is the official one
It allows for easy instrumentation of ruby apps, but they also have support for PHP, Java, .NET, and Python
Part of their secret of success is the easy at how developers can get metrics from their application by adding a few files, and a token.
Several other cloud monitoring vendors are stepping into arena, and I really hope to see them grow the space and give some competition:
- Scout : https://scoutapp.com comes from the traditional server mangement and is slowly moving to the application metrics
- Librato : https://metrics.librato.com can lerage existing agents such as StatsD, CollectD, and JMX
- Boundary : https://boundary.com has a focus on realtime view of metrics
- DataDog: http://www.datadoghq.com goes for a complete overview of all your metrics
Some other complementary services, popular amongst developers are:
- Get Exceptional: http://www.getexceptional.com/ It tracks errors in web apps. It reports them in real-time and gathers the info you need to fix them fast.
- Airbrake: http://airbrake.io it collects errors generated by other applications, and aggregates the results for review.
- Alert grid: http://alert-grid.com/ a Workflow system , ala yahoo pipes for notifications
- Proby: http://probyapp.com/ Cron monitoring made simple
- Pingdom: http://www.pingdom.com/ Uptime and performance monitoring made easy
- Pagerduty: http://www.pagerduty.com/ Alerting that can be easily hooked into your existing monitoring solution
Check this blogpost on Monitoring Reporting Signal, Pingdom, Proby, Graphite, Monit , Pagerduty, Airbrake to see how they make a powerful team.
User tracking Metrics - Cloud
Clicks, Page view etc ...
Besides the application metrics, there is one other major player in web metrics. Google Analytics
I found several tools to get data out of it using the Google Analytics API
- Garb - A ruby wrapper for the google analytics API: https://github.com/vigetlabs/garb
- Gem for talking to Google Analytics API: https://github.com/rumble/gattica
- Google Analytics with Ruby and Garb: http://www.viget.com/extend/google-analytics-api-with-ruby-and-garb-making-it-even-easier
- More recent: Google Analytics Data Export API with Ruby + Gattica: http://www.seerinteractive.com/blog/google-analytics-data-export-api-with-rubygattica/2011/02/22
- Gattica - More recent fork by Deviantech, with goals and segment support: https://github.com/chrisle/gattica
With google Analytics there is always a delay on getting your data;
If you want to have realtime statistics/metrics checkout Gaug.es http://get.gaug.es :
- Use the Gauges gem: https://github.com/orderedlistinc/gauges-gem to import/export data
Haven't really gotten into this, but well worth exploring getting metrics out of A/B testing
- Optimizely: http://www.optimizely.com
- Visual Website Optimizer: http://visualwebsiteoptimizer.com
- Google Web Optimizer: http://www.google.com/websizeoptimizer
Page render time
Another important to track is the page render time. This is well explained in the Real User Monitoring- Chapter 10 - Complete Web Monitoring - O'Reilly Media
Again Newrelic provides RUM : Real User Monitoring. See How we provide real user monitoring: A quick technical review for more technical info
- Keynote: monitoring like a real user experiences your website - http://www.keynote.com
- Real User Monitoring New Relic - http://newrelic.com/rum
- Tracking metrics - Velvet Metrics: http://www.velvetmetrics.com
Who needs a cloud anyway
Putting your metrics into the cloud can be very convenient , but it has downsides:
- most tools don't have way to redirect/replicate the metrics they collect internally
- that makes it hard to correlate with your internal metrics
- it's easy to get metrics in, but hard to get the full/raw data out again
- it depends on the internet , duh, and sometimes this fails :)
- or privacy or the volume of metrics just isn't possible to put it out in the cloud
Application Metrics - Non - Cloud
In his epic Metrics Anywhere, Codahale explains the importance of instrumenting your code with metrics. This looks very promising as this is really driven from the developers world:
- CodaHale Metrics: https://github.com/codahale/metrics allows Capturing JVM- and application-level metrics.
- Simon Java - Simple Monitoring API: http://code.google.com/p/javasimon
- Stajistics a free monitoring and runtime performance statistics collection API for Java https://code.google.com/p/stajistics
- Parfait- Java performance framework: https://code.google.com/p/parfait
Or you can always use JMX to monitor/metrics from your application
- JMX - JR Ruby Jmx: https://github.com/enebo/jmxjr allows you to access the Mbeans as a ruby class
- An example using JMX and Jruby: https://github.com/nicksieger/advent-jruby
And with JMX-trans http://code.google.com/p/jmxtrans you can feed jmx information into Graphite, Ganglia, Cacti/Rrdtool,
- Ruby Metrics Equivalent by JohnEwart: https://github.com/johnewart/ruby-metrics
- Ruby - FnordMetric https://github.com/paulasmuth/fnordmetric is a highly configurable (and pretty fast) realtime app/event tracking thing based on ruby eventmachine and redis. You define your own plotting and counting functions as ruby blocks!
- Pinba - Monitoring Php Processing using Timers - http://pinba.org/wiki/Main_Page
- A good overview post about collecting application metrics in java
- An opensource New Relic clone : https://github.com/devmen/FreeRelic
Esty style: StatsD
To collect various metrics, Etsy has created StatD https://github.com/etsy/statsd a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite.
There have been written clients in many languages php, java, ruby etc..
- Ruby gems for registering metrics with Statsd - http://rubydoc.info/gems/fozzie
- A Statsd Server in Ruby - https://github.com/fetep/ruby-statsd
- A Statsd Client in Ruby - https://github.com/github/statsd-ruby
- Another Statsd Client in Ruby - http://github.com/bvandenbos/statsd-client
- A Statsd client that isn't a direct port- https://github.com/reinh/statsd
- Statd instrumentation via Metaprogramming Methods in Ruby - https://github.com/shopify/statsd-instrument
It's incredible to see the power and simplicity of this; I've created a simple Proof of Concept to extract the statsd metrics on ZeroMQ in this experimental fork
MetricsD https://github.com/tritonrc/metricsd tries to marry both Etsy's statsD and the Coda Hale / Yammer's Metrics Library for the JVM and puts the data into Graphite. It should be drop-in compatible with Etsy's statsd, although with added explicit support for meters (with the m type) and gauges (with the g type) and introduce the h (histogram) type as an alias for timers (ms).
User tracking - Non Cloud
Clicks, Page view etc ...
Here are some Open Source Web Analytics libraries. These are merely links, haven't investigated it enough, work in progress
- Open Web Analytics
- Grape Web Statistics
- Ruwa - Ruby on Rails Web Analytics
- Riopro/piwik - ruby gem
- JKraemer piwik-tracker
- Autometal - Piwik ruby gem
- Awstats Reader - in Python
Another tool worth mentioning for tracking endusers is HummingBird - http://hummingbirdstats.com/ . It is NodeJS based an allows for realtime web traffic visualization. To send metrics is has a very simple UDP protocol.
He pointed out several A/B testing frameworks:
- ABingo : Rails A/B Testing - http://www.bingocardcreator.com/abingo
- Seven Minute Abs: Rails A/B Testing - https://github.com/paulmars/seven_minute_abs
- Vanity: Experiment Driven Development - http://vanity.labnotes.org/
And presented his own A/B Testing framework: Split - http://github.com/andrew/split
It would be interesting to integrate this further into traditional Monitoring/Metrics tools. View metrics per new version/enabled flags etc... In a Nutshell food for thought.
Page render time
For checking the page render time, I could not really found Open Source Alternatives.
It's exciting to see the cross over between both development, operations and business. Up until now only New Relic has a very well integrated suite for all metrics. Hope the internal solutions catch up.
Now that we have all that data, it's time to talk about dashboards and visualization. On to the next blogpost.
If you are using other tools, have ideas, feel free to add them in the comments.
Whenever you hear a new theory or idea (like devops), people ask for proof before they engage. This is only natural I guess. This was the reason why I wanted to explore ways to measure devops success. Or rephrased: "Measuring the devops gap" . The result of my findings were presented at VelocityConf in June. While I initially planned to present with Israel Gat due to circumstances he switched at the last responsible moment with Andrew Shafer.
In the talk we used the metaphor of monitoring to explore the metrics. This probably put a lot of people on the wrong track, expecting a lot of real/ganglia metrics. The main point is that the higher level your monitoring the more interesting: measure the end-user perspective in both technical and human monitoring gives you the most value. Monitoring a server is like monitoring an individual person: good to know ,but it doesn't tell you anything on the end-result. I know this is meta-stuff , so you need your head clear to understand.
One of the nicest findings during the research was that whenever people say collaboration(in general) will improve things, there is a high demand for proof. The phrase "Collaboration is like a black hole, you can only measure it effects by looking at it's effects".
Also it makes no sense to increase the number of interactions between different groups (dev, ops, qa, mgt) for the sake of increasing it: "More interaction doesn't mean a better party" . You should work on the quality of the interactions.
A third take-away I got from the research what that in the Design world, they too are exploring ideas outside the usual user-centered design and are going the way of Participatory design. This fits directly to the devops idea, you automate and do all the stuff you need to, to free up more time in design. And there you can collaborate with your collegues, your peers from other groups and even with your end-users to test out new ideas. In the past design used to be a collective and shared ability. Only in the recent years this craft has become an individual thing.
Enjoy the presentation :