↓ Archives ↓

Archive → January, 2012

Puppet vs. Chef – The Devops Deathmatch

LAAAAAAADIESS ANNNNDDD GENNNELMEN…… LET’S GET READYYYYYYY TO RUMMMMMBBBBBLLLLLEEEEEEEE!!!!!!!!! Ok, enough silliness. For the next few sentences anyway… I’ve been using Puppet to mange systems for the last four years (at least!) however a new contract has meant I’ve needed to learn Chef. A few months ago I was looking for a blog post on the […]

Best and Worst Hacker Movies Ever

Whilst at the Puppet Triage-a-thon today we put on a few films that featured hacking and hackers. This prompted a discussion of the worst hacker movies of all time and their appalling representations of technology. I present our brief list. Note SOME of these movies were good despite the appalling use of technology … others not so much.

Hackers

What a cast of off-beat and improbably handsome computer hackers: Crash Override AKA Zero Cool, Acid Burn, Lord Nikon (token black hacker!), The Phantom Phreak (token Puerto Rican hacker!), Cereal Killer and of course my personal favourite: Mr The Plague. In the classic “Damn you meddling kids” scenario famous since Enid Blyton and Scooby Doo our youthful hackers foil the evil plans of evil hacker The Plague to steal money and sink oil tankers?!? Add and stir rapidly a bumbling Secret Service agent, a lame romantic sub-plot, hip young folk on roller blades and a club scene soundtrack. Instant hot geek.

The Good: Social engineering at the start of the movie as Crash Override hacks the TV Network.

The Bad: The typically ridiculous depiction of the actual hacking. The appalling clothes. Matthew Lillard.

Swordfish

Oh my … where to start with Swordfish? We have the classic hand-wavey construction of a virus to hack 1024-bit DNA Binary SSL HiTek Security TM. We have “Axl” Torvalds - famous and bloodily assassinated Finnish hacker. And of course we have Hugh Jackman’s improbable hacker pentetrating a government system in 60 seconds with a gun held to his head and whilst getting blown. Oh please.

The Good: Ummm… Halle Berry?

The Bad: Pretty much everything technical AND non-technical in the film.

P.S. None of which compares to John Travolta’s massive piece of over-acting in a performance that would be truly hideous had Travolta not made Battlefield Earth.

Firewall

Another “blackmailed hero forced to do wrong” story has Harrison Ford as a corporate security expert forced to rob his own employer. Mayhem ensures, family kidnapped, lamest bank security ever and incredibly tenuous connection between the title of the movie and the plot.

The Good: Using his daughter’s iPhone as a USB drive…

The Bad: Using his daughter’s iPhone as a USB drive… Oh wait. Completely unrealistic security controls. Also firing his secretary on the spot? Yeah the script writers have never worked in a bank.

WarGames

Rogue AIs, war dialing hacker, cute teen romance, nukes and Cheyenne Mountain? What’s not to like about this movie? Computer hacker breaks into what he thinks is a game company but rather finds a back-end into a top secret NORAD system. From there whilst trying to play a war game he accidentally triggers the countdown to World War III and nuclear war. He then rushes to try to stop the AI launching it’s missiles and “winning” the game.

The Good: Actually resembling good depiction of social engineering, war dialing, and how a hacker compromised a system back in the day. Except the AI of course. Everyone knows the US government doesn’t have an AI running war games at NORAD…

The Bad: Actually there’s not much to hate about this film. I mean the AI theory (tic-tac-toe as learning tool) is kinda lame but generally it’s good fun AND has Ally Sheedy.

P.S. We don’t discuss the sequel - WarGames: The Dead Code. IT DID NOT HAPPEN. DID NOT HAPPEN.

Sneakers

What is it with buttons/ciphers/boxes that can magically hack through all encryption instantly? Yep here’s another. This time in a unique twist a mathematical genius comes up with a box that allows all encryption to be overcome. Naturally it also magically interfaces automatically with every system ever developed.

Throw in a quirky team of hackers lead by on-the-run super-hacker Robert Redford (BTW no one background checked this guy for 30 years? Really? Whatever…). The team’s line-up even sets the pattern for a number of future films: young kid looking to prove himself, cynical older operative, brilliant but disabled tech, conspiracy obsessed oddball, etc, etc.

Sneakers though does have some street cred. Despite the magic cypher box they use some nifty social engineering and much of the ‘tiger team’ hacking they perform at the start of the movie is reasonably feasible. They even hearken back to the glory days of phone phreaking with a Cap’n Crunch reference.

The Good: Cosmo on InfoWar - “There’s a war out there, old friend. A world war. And it’s not about who’s got the most bullets. It’s about who controls the information. What we see and hear, how we work, what we think… it’s all about the information!”

AntiTrust

Open source. Obviously most people know us because of the beaucoup bucks we make as rock-star developers but many people don’t realize it’s also mandatory to have our model-like good looks. This movie thankfully captures that and the purity of our code. Like some others in the list it’s not a true hacker film, more a glorified action movie/thriller, but some code is actually cut and they do introduce open source software.

The Good: Hmmm. Ummm. “Look! Over there! A large software company that really isn’t Microsoft… honest.” Probably vaguely raised public awareness that open source code exists.

The Bad: That anyone got paid for this film.

The Net

Reclusive but beautiful developer (giggles) discovers secret Internet conspiracy and is forced to go on the run. Against all odds do you think she’ll win through? Yeah I totally didn’t either. So apparently every website has a magic link (and you can bet about a thousand idiots added that icon to their sites right after the movie came out) that allows you to take control of it. Despite the magical decryption link this isn’t actually a hacking movie. The main character cuts code and the evil terrorists use the Internet to conduct their foul deeds but it’s largely a rather poor action movie.

The Good: Nothing.

The Bad: A little icon on every web page that unlocks the secrets of the Interwebs…

Crowbar is quietly getting more interesting

Crowbar is an interesting project that I’ve covered before. Born out of Dell’s cloud group, much of the initial buzz described it as an installer for the cloud era… “kickstart on steroids”, if you will.

Crowbar’s close association with the OpenStack project has further cemented its reputation as an installer to watch. But’s it’s Crowbar’s quiet potential as a stack management tool that is the most interesting. Through the use of barclamps (Crowbar’s modules) you can tell Crowbar to build a full stack from the BIOS config all the way up to your middleware and applications. John Willis on an episode of DevOps Cafe called it “Data Center as Code”.

Crowbar barclamps are also an interesting way for independent projects or vendors to ensure that their projects/products can be easily integrated into a custom platform (today this type of focus is usually in the context of making things work on OpenStack). Want to add a new component to your platform? Grab the barclamp and Crowbar will know how to do the rest. Or at least that is the promise. The project is still young and the community is still forming.

Leading open source software projects is new territory for Dell, as a company, but the Crowbar team does seem committed and community focused. I’ve heard some grumbles from developers that barclamp development and testing cycles can be a bit tedious due to the nature of what you are building. But no reason to believe that those types of issues won’t get sorted out over time. 

A couple of Crowbar related videos are below:

The first video was made by my DTO Solutions colleague, Keith Hudgins, after he wrote a barclamp for Zenoss. It’s a short demo and tour that can give you a feel for Crowbar and Barclamps.

The next video is Barton George (Dell) interviewing Rob Hirshfeld (Dell). They start off talking about the Hadoop barclamp but quickly getting into a broader discussion about Crowbar. 

The post Crowbar is quietly getting more interesting appeared first on dev2ops.

Crowbar is quietly getting more interesting (video)

Crowbar is an interesting project that I've covered before. Born out of Dell's cloud group, much of the initial buzz described it as an installer for the cloud era... "kickstart on steroids", if you will.

Crowbar's close association with the OpenStack project has further cemented its reputation as an installer to watch. But's it's Crowbar's quiet potential as a stack management tool that is the most interesting. Through the use of barclamps (Crowbar's modules) you can tell Crowbar to build a full stack from the BIOS config all the way up to your middleware and applications. John Willis on an episode of DevOps Cafe called it "Data Center as Code".

Crowbar barclamps are also an interesting way for independent projects or vendors to ensure that their projects/products can be easily integrated into a custom platform (today this type of focus is usually in the context of making things work on OpenStack). Want to add a new component to your platform? Grab the barclamp and Crowbar will know how to do the rest. Or at least that is the promise. The project is still young and the community is still forming.

Leading open source software projects is new territory for Dell, as a company, but the Crowbar team does seem committed and community focused. I've heard some grumbles from developers that barclamp development and testing cycles can be a bit tedious due to the nature of what you are building. But no reason to believe that those types of issues won't get sorted out over time. 

A couple of Crowbar related videos are below:

The first video was made by my DTO Solutions colleague, Keith Hudgins, after he wrote a barclamp for Zenoss. It's a short demo and tour that can give you a feel for Crowbar and Barclamps.

 

The next video is Barton George (Dell) interviewing Rob Hirshfeld (Dell). They start off talking about the Hadoop barclamp but quickly getting into a broader discussion about Crowbar. 

 

Devops a Wicked problem

One of the strong pillars of devops (if not the strongest) is the collaboration/communication. For the talk about Devops Metrics for Velocity 2011 I researched how to prove collaboration is a good thing: while discussing devops to people it sometimes comes to believe that it makes sense to collaborate more or that all this collaboration is overkill. I think at time I came across Design Thinking and read how it evolved from 1 person doing the design to listening to user requirements to participatory design. In the book Design Thinking - Understanding Designers Think Nigel Cross writes that design used to be collaborative thing (like guilds trying to push their craft forward).

Symmetry of Ignorance

One of the concepts introduced was the symmetry of ignorance PDF

Complex design problems require more knowledge than any one single person can possess, and the knowledge relevant to a problem is often distributed and controversial. Rather than being a limiting factor, “symmetry of ignorance” can provide the foundation for social creativity. Bringing different points of view together and trying to create a shared understanding among all stakeholders can lead to new insights, new ideas, and new artifacts. Social creativity can be supported by new media that allow owners of problems to contribute to framing and solving these problems. These new media need to be designed from a meta-design perspective by creating environments in which stakeholders can act as designers and be more than consumers.

Sounds like systems thinking and reminded me of the knowledge divide within the devops problem space. When you spend time with each group/silo individually they would of think themselves superior to the other group: "ha those devs they don't know anything about the systems, ha those ops don't anything about coding". So it seems more about the symmetry of arrogance . That arrogance symmetry reminded "We judge others by their behavior, we judge ourselves by our intentions". We might think we know more/can do better, but that often not visible in our actions.

This kind of got me intrigued and I wanted to explore the subject more for the next Cutter Summit 2012.

Wicked Problem

Part of the designing thinking and this symmetry of ignorance is related to the concept of wicked problems

Rittel and Webber's (1973) formulation of wicked problems specifies ten characteristics:

  1. There is no definitive formulation of a wicked problem (defining wicked problems is itself a wicked problem).
  2. Wicked problems have no stopping rule.
  3. Solutions to wicked problems are not true-or-false, but better or worse.
  4. There is no immediate and no ultimate test of a solution to a wicked problem.
  5. Every solution to a wicked problem is a "one-shot operation"; because there is no opportunity to learn by trial and error, every attempt counts significantly.
  6. Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, nor is there a well-described set of permissible operations that may be incorporated into the plan.
  7. Every wicked problem is essentially unique.
  8. Every wicked problem can be considered to be a symptom of another problem.
  9. The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem's resolution.
  10. The planner has no right to be wrong (planners are liable for the consequences of the actions they generate).

I'll let you judge if you think devops (or even monitoring sucks :) is a wicked problem

More readings to explore:

Cynefin

The whole discission on what is a wicked problem or not reminded me of a talk by Dave Snowden. He helped creating the Cynefin model.

The Cynefin framework has five domains.The first four domains are:

  1. Simple, in which the relationship between cause and effect is obvious to all, the approach is to Sense - Categorise - Respond and we can apply best practice.
  2. Complicated, in which the relationship between cause and effect requires analysis or some other form of investigation and/or the application of expert knowledge, the approach is to Sense - Analyze - Respond and we can apply good practice.
  3. Complex, in which the relationship between cause and effect can only be perceived in retrospect, but not in advance, the approach is to Probe - Sense - Respond and we can sense emergent practice.
  4. Chaotic, in which there is no relationship between cause and effect at systems level, the approach is to Act - Sense - Respond and we can discover novel practice.
  5. Disorder

Note this a sense making framework, not a ordering framework: it's not always that exact to put your problems in each of the spaces, but it gets you thinking about which solutions to apply to which problems. And it fits in nicely with other frameworks as explained in A Tour of Adoption and Transformation models

So devops in my opinion, falls into the complex problem space.

A great video explaining it was recorded at the ALE 2011:

He explains many things, but here a few things that resonated with me:

  • why in some problem spaces there is no best practice but only good practice
  • we have to create fail-safe environments
  • providing a solution to the problems in complex problems can be done by probing
  • the human factor makes the difference / we are not machines (automation)
  • the solution is often easy once you have solved it but you need to go through the proces of discovery.

that last point reminded me of the Debt Metaphor - Ward Cunningham. @littleidea explained that Ward was using a different concept for Technical Debt that most people use: he explains technical debt as the difference between the implementation and the ideal implementation on hinsight. Not because of bad implementation, or deliberate shortcuts, but because of new insights gathered during the discovery/problem solving process.

More research can be found at:

The fact that problems don't always stay/match one of the locations on the diagram is greatly visualized by adding dimensions to the diagram (a thing that got lost in the initial publication)

To tackle complex problems he suggests using three principles of complexity based management:

  1. Use fine grained objects: avoid "chunking"
  2. Distributed Cognition: the wisdom but not the foolishness of crowds
  3. Disintermediation: connecting decision makers with raw data

This could result in the Resilient Organisation

Resilience engineering

Because in complex systems it's hard to predict the exact behavior, Dave Snowden also talks about going From Robustness to Resiliance. It almost sounded like the difference between MTBF and MTTR like John Allspaw explains in Outages Post-Mortems and Human Error 101.

I came across those articles but never put them into the light of the Snowden perspective. More to explore so.

Silos and Resilience

The final document I'd like to highlight is about Reducing the impact of Organisational Silos on Resilience.

Stone quotes five questions suggested by Angela Drummond (a practitioner in the area of silo breaking and organisational change) to help executives identify and overcome silos.

  • “does your organisation structure promote collaboration, or do silos exist?
  • “do you have collaboration in your culture and as part of your value system?
  • “do you have the IT infrastructure for effective collaboration?
  • “do you believe in collaboration? Do you model that belief?
  • “do you have a reward system for collaboration?

Quoting from the article:

Resilience cannot be achieved in isolation of other units and organisations. In summary, there is a need to recognise:

  • the characteristics of silo formation, particularly in the creation of new organisational structures or as part of change management processes
  • a convergence of interests, taking account of the fact that “we are all in this together”. Efforts are needed to achieve seamless internal relationships at the intraorganisational level and a commitment to work with others to advance community resilience (perhaps with a judicious contribution from government) at the broader societal level
  • the case for collaboration. Gains are often possible by pooling ideas and resources (the total is greater than the sum of the parts)
  • the value of harnessing grass-root capability including through continuous knowledge-building and sharing learnings in a trusted environment
  • that cost-effectiveness calculations don’t easily take account of broad organisational or social needs and that the analysis may need supplementation if wide objectives are to be met

Leadership is the key to bringing these elements together. Leadership is needed to reduce and mitigate risks before crises occur.

It was fascinating to read the collaboration and resilience go hand in hand. And breaking the silos is really a must there and requires collaboration. Also the inter-company silos fits in nicely with The Agile Executive - A new Context for Agile presentation on how we come to rely on external services in a SAAS model and this will be another silo to tackle.

Final note

This is all research in progress, but it's exciting to see a lot of different concepts fit in nicely. I apologize that this isn't yet a complete polished train of thought, but it might be useful to explore more on the subject.

Convincing management that cooperation and collaboration was worth it

While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at Flickr, and why. I did this in the hope that other Yahoo properties could learn from that team’s process and culture, which we worked really hard at building and keeping.

The idea that Development and Operations could:

  • Share responsibility/accountability for availability and performance
  • Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response
  • Build and maintain a deferential culture to each other when it came to domain expertise
  • Cultivate equanimity when it came to emergency response and post-mortem meetings

…wasn’t evenly distributed across other Yahoo properties, from my limited perspective.

But I knew (still know) lots of incredible engineers at Yahoo that weren’t being supported as they could be by their upper management. So sending this letter was driven by wanting to help their situation. Don’t get me wrong, not everything was rainbows and flowers at Flickr, but we certainly had a lot more of them than other Yahoo groups.

When I re-read this, I’m reminded that when I came to Etsy, I wasn’t entirely sure that any of these approaches would work in the Etsy Engineering environment. The engineering staff at Etsy was a lot larger than Flickr’s and continuous deployment was in its infancy when I got there. I can now happily report that 2 years later, these concepts not only solidified at Etsy, they evolved to accommodate a lot more than what challenged us at Flickr. I couldn’t be happier about how it’s turned out.

I’ll note that there’s nothing groundbreaking in this note I sent, and nothing that I hadn’t said publicly in a presentation or two around the same time.

This is the note I sent to the three layers of management above me in my org at Yahoo:

Subject: Why Flickr went from 73rd most popular Y! property in 2005 to the 6th, 5 years later.

Below are my thoughts about some of the reasons why Flickr has had success, from an Operations Engineering manager’s point of view.

When I say everyone below, I mean all of the groups and sub-groups within the Flickr property: Product, Customer Care, Development, Service Engineering, Abuse and Advocacy, Design, and Community Management.

Here are at least some of the reasons we had success:

    • Product included and respected everyone’s thoughts, in almost every feature and choice.
    • Everyone owned availability of the site, not just Ops.
    • Community management and customer service were involved early and often. In everything. If they weren’t, it was an oversight taken seriously, and would be fixed.
    • Development and Operations had zero divide when it came to availability and performance. No, really. They worked in concert, involving each other in their own affairs when it mattered, and trusting each other every step of the way. This culture was taught, not born.
    • I have never viewed Flickr Operations as firefighters, and have never considered Flickr Dev Engineering to be arsonists. (I have heard this analogy elsewhere in Yahoo.) The two teams are 100% equal partners, with absolute transparency. If anything, we had a problem with too much deference given between the two teams.
    • The site was able to evolve, change, and grow as fast as needed to be as long as it was made safe to do so. To be specific: code and config deploys. When it wasn’t safe, we slowed, and everyone was fine with that happening, knowing that the goal was to return to fast-as-we-need-to-be. See above about everyone owning availability.
    • Developers were able to see their work almost instantly in production. Institutionalized fear of degradation and outage ensured that changes were as safe as they needed to be. Developers and Ops engineers knew intuitively that the safety net you have is the one that you have built for yourself. When changes are small and frequent, the causes of degradation or outage due to code deploys are exceptionally transparent to all involved. (Re-read above about everyone owning availability.)
    • We never deployed “early and often” because it was:
      • a trend,
      • we wanted to brag,
      • or because we think we’re better than anyone. (We did it because it was right for Flickr to do so.)
    • Everyone was made aware of any launches that had risks associated with it, and we worked on lists of things that could possibly go wrong, and what we would do in the event they did go wrong. Sometimes we missed things, and we had to think quickly, but those times were rare with new feature launches.
    • Flickr Ops had always had the “go or no-go” decision, as did other groups who could vote with respect to their preparedness. A significant part of my job was working towards saying “go”, not “no-go”. In fact, almost all of it.

Examples: the most boring (anti-climatic, from an operational perspective) launches ever

    • Flickr Video: I actually held the launch back by some hours until we could rectify a networking issue that I thought posed a risk to post-launch traffic. Other than that, it was a switch in the application that was turned from off to on. The feature’s code had been on prod servers for months in beta. See ‘dark launch’
    • Homepage redesign: Unprecedented amount of activity data being pulled onto the logged-in homepage, order of magnitude increase in the number of calls to backend databases. Why was it boring? Because it was dark launched 10 days earlier. The actual launch was a flip of the ‘on’ switch
    • People In Photos (aka, ‘people tagging’): Because the feature required data that we didn’t actually have yet, we couldn’t exactly dark launch it. It was a feature that had to be turned on, or off. Because of this, Flickr’s Architect wrote out a list of all of the parts of the feature that could cause load-related issues, what the likelihood of each was, how to turn those parts of the feature off, what custome care affect it might have, and what contingencies would probably require some community management involvement.

Dark Launches

When we already have the data on the backend needed to display for a new feature, we would ‘dark launch’, meaning that the code would make all of the back-end calls (i.e. the calls that bring load-related risk to the deploy) and simply throw the data away, not showing it to the user. We could then increase or decrease the percentage of traffic who made those calls in safety, since we never risked the user experience by showing them a new feature and then having to take it away because of load issues.

This increases everyone’s confidence almost to the point of apathy, as far as fear of load-related issues are concerned. I have no idea how many code deploys there were made to production on any given day in the past 5 years (although I could find it on a graph easily), because for the most part I don’t care, because those changes made in production have such a low chance of causing issues. When they have caused issues, everyone on the Flickr staff can find on a webpage when the change was made, who made the change, and exactly (line-by-line) what the change was.

In the case where we had confidence in the resource consumption of a feature, but not 100% confidence in functionality, the feature was turned on for staff only. I’d say that about 95% of the features we launched in those 5 years were turned on for staff long before they were turned on for the entire Flickr population. When we still didn’t feel 100% confident, we ramped up the percentage of Flickr members who could see and use the new feature slowly.

Config Flags

We have many pieces of Flickr that are encapsulated as ‘feature’ flags, which look as simple as: $cfg[disable_feature_video] = 0; this allows the site to be much more resilient to specific failures. If we have any degradation within a certain feature, we can simply turn that feature off in many cases, instead of taking the entire site down. These ‘flags’ have, in the past, been prioritized with conversations with Product, so there is an easy choice to make if something goes wrong and site uptime becomes opposed to feature uptime.

This is an extremely important point: Dark Launches and Config Flags, were concepts and tools created by Flickr Development, not Flickr Operations, even though the end-result of each points toward a typical Operations goal: stability and availability. This is a key distinction. These are initiatives made by Engineering leadership because devs feel protective of the availability of the site, respectful of Operations responsibilities, and just plain good engineering.

If the Flickr Operations had built these tools and approaches to keeping the site stable, I do not believe we would have the same amount of success.

There is more on this topic here: http://code.flickr.com/blog/2009/12/02/flipping-out/

Summary

Flickr Operations is in an enviable position in that they don’t have to convince anyone in the Flickr property that:

      1. Operations has ‘go or no-go’ decision-making power, along with every other subgroup.
      2. Spending time, effort, and money to ensure stable feature launches before they launch is the rule, not the exception.
      3. Continuous Deployment is better for the availability of the site
      4. Flickr Operations should be involved as early as possible in the development phase of any project

These things are taken for granted. Any other way would simply feel weird.

I have no idea if posting this letter helps anyone other than myself, but there you go.

Puppet and Flowdock

I’ve written a Puppet report processor that allows you to notify Flowdock - a nifty tool for team collaboration - of failed Puppet runs.

It requires the flowdock gem to be installed on your Puppet master:

$ sudo gem install flowdock

You can then install puppet-flowdock as a module in your Puppet master’s modulepath. Now update the flowdock_api_key variable in the /etc/puppet/flowdock.yaml with your Flowdock API key.

Then enable pluginsync and reports on your master and clients in puppet.conf including specifying the flowdock report processor.

[master]
report = true
reports = flowdock
pluginsync = true
[agent]
report = true
pluginsync = true

Finally, run the Puppet client and sync the report as a plugin and hey presto you’re logging failures to Flowdock.

Monitoring Wonderland Survey – Visualization

A picture tells more than a ...

Now that you've collected all the metrics you wanted or even more , it's time to make them useful by visualizing them. Every respecting metrics tool provides a visualization of the data collected. Older tools tended to revolve around creating RRD graphics from the data. Newer application are leveraging javascript or flash frameworks to have the data updated in realtime and rendered by the browser. People are exploring new ways of visualizing large amounts of data efficiently. A good example is Visualizing Device Utilization by Brendan Gregg. or Multi User - Realtime heatmap using Nodejs

Several interesting books have been written about visualization:

Dashboard written for specific metric tools

Graphite

Graphs are Graphite's killer feature, but there's always room for improvement:

Grockets - Realtime streaming graphite data via socket.io and node.js

Opentsdb

Graphs in Opentsdb are based on Gnuplot

Ganglia

Collectd

Nagios

Nagios also has a way to visualize metrics in it's UI

Overall integration

With all these different systems creating graphs, the nice folks from Etsy have provided a way to navigate the different systems easily via their dashboard - https://github.com/etsy/dashboard

I also like the Idea of Embeddable Graphs as http://explainum.com implements it

Development frameworks for visualization

Generic data visualization

There are many javascript graphing libraries. Depending on your need on how to visualize things, they provide you with different options. The first list is more a generic graphic library list

Time related libraries

To plot things many people now use:

For timeseries/timelines these libraries are useful:

And why not have Javascript generate/read some RRD graphs :

Annotations of events in timeseries:

On your graphs you often want event annotated. This could range from plotting new puppet runs , tracking your releases to everything that you do in the proces of managing your servers. This is what John Allspaw calls Ops-Metametrics

These events are usually marked as vertical lines.

Dependencies graphs

One thing I was wondering is that with all the metrics we store in these tools, we store the relationships between them in our head. I researched for tools that would link metrics or describe a dependency graph between them for navigation.

We could use Depgraph - Ruby library to create dependencies - based n graphviz to draw a dependency tree, but we obviously first have to define it. Something similar to the Nagios dependency model (without the strict host/service relationship of course)

Conclusion

With all the libraries to get data in and out and the power of javascript graphing libraries we should be able to create awesome visualizations of our metrics. This inspired me and @lusis to start thinking about creating a book on Metrics/Monitoring graphing patterns. Who knows ...

Monitoring Wonderland Survey – Moving up the stack Application and User metrics

While all the previously described metric systems have easy protocols, they tend to stay in Sysadmin/Operations land. But you should not stop there. There is a lot more to track than CPU,Memory and Disk metrics. This blogpost is about metrics up the stack: at the Application Middleware, Application and the User Usage.

To the cloud

Application Metrics

Maybe grumpy sysadmins have scared the developers and business to the cloud. It seems that the space of Application metrics, whether it's Ruby, Java , PHP is being ruled today by New Relic In a blogpost New Relic describes serving about 20 Billion Metrics A day.

It allows for easy instrumentation of ruby apps, but they also have support for PHP, Java, .NET, and Python

Part of their secret of success is the easy at how developers can get metrics from their application by adding a few files, and a token.

Several other cloud monitoring vendors are stepping into arena, and I really hope to see them grow the space and give some competition:

Some other complementary services, popular amongst developers are:

Check this blogpost on Monitoring Reporting Signal, Pingdom, Proby, Graphite, Monit , Pagerduty, Airbrake to see how they make a powerful team.

User tracking Metrics - Cloud

Clicks, Page view etc ...

Besides the application metrics, there is one other major player in web metrics. Google Analytics

I found several tools to get data out of it using the Google Analytics API

With google Analytics there is always a delay on getting your data;

If you want to have realtime statistics/metrics checkout Gaug.es http://get.gaug.es :

A/B Testing

Haven't really gotten into this, but well worth exploring getting metrics out of A/B testing

Page render time

Another important to track is the page render time. This is well explained in the Real User Monitoring- Chapter 10 - Complete Web Monitoring - O'Reilly Media

Again Newrelic provides RUM : Real User Monitoring. See How we provide real user monitoring: A quick technical review for more technical info

Who needs a cloud anyway

Putting your metrics into the cloud can be very convenient , but it has downsides:

  • most tools don't have way to redirect/replicate the metrics they collect internally
  • that makes it hard to correlate with your internal metrics
  • it's easy to get metrics in, but hard to get the full/raw data out again
  • it depends on the internet , duh, and sometimes this fails :)
  • or privacy or the volume of metrics just isn't possible to put it out in the cloud

Application Metrics - Non - Cloud

In his epic Metrics Anywhere, Codahale explains the importance of instrumenting your code with metrics. This looks very promising as this is really driven from the developers world:

Java

Or you can always use JMX to monitor/metrics from your application

And with JMX-trans http://code.google.com/p/jmxtrans you can feed jmx information into Graphite, Ganglia, Cacti/Rrdtool,

Other

Esty style: StatsD

To collect various metrics, Etsy has created StatD https://github.com/etsy/statsd a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite.

There have been written clients in many languages php, java, ruby etc..

Other companies have been raving about the benefits of StatsD and for example Shopify has completely integrated it in their environment

It's incredible to see the power and simplicity of this; I've created a simple Proof of Concept to extract the statsd metrics on ZeroMQ in this experimental fork

MetricsD https://github.com/tritonrc/metricsd tries to marry both Etsy's statsD and the Coda Hale / Yammer's Metrics Library for the JVM and puts the data into Graphite. It should be drop-in compatible with Etsy's statsd, although with added explicit support for meters (with the m type) and gauges (with the g type) and introduce the h (histogram) type as an alias for timers (ms).

User tracking - Non Cloud

Clicks, Page view etc ...

Here are some Open Source Web Analytics libraries. These are merely links, haven't investigated it enough, work in progress

Another tool worth mentioning for tracking endusers is HummingBird - http://hummingbirdstats.com/ . It is NodeJS based an allows for realtime web traffic visualization. To send metrics is has a very simple UDP protocol.

A/B Testing

At Arrrrcamp I saw a great presentation on A/B Testing by Andrew Nesbitt(@teabass. Do watch the video to get inspired!

He pointed out several A/B testing frameworks:

And presented his own A/B Testing framework: Split - http://github.com/andrew/split

It would be interesting to integrate this further into traditional Monitoring/Metrics tools. View metrics per new version/enabled flags etc... In a Nutshell food for thought.

Page render time

For checking the page render time, I could not really found Open Source Alternatives.

There is a page by Steve Sounders about Episodes http://stevesouders.com/episodes/paper.php. Or you can track your Apache logs with Mod Log I/O

Conclusion

It's exciting to see the cross over between both development, operations and business. Up until now only New Relic has a very well integrated suite for all metrics. Hope the internal solutions catch up.

Now that we have all that data, it's time to talk about dashboards and visualization. On to the next blogpost.

If you are using other tools, have ideas, feel free to add them in the comments.

Graphite, JMXTrans, Ganglia, Logster, Collectd, say what ?

Given that @patrickdebois is working on improving data collection I thought it would be a good idea to describe the setup I currently have hacked together.

(Something which can be used as a starting point to improve stuff, and I have to write documentation anyhow)

I currently have 3 sources , and one target, which will eventually expand to at least another target and most probably more sources too.

The 3 sources are basically typical system data which I collect using collectd, However I`m using collectd-carbon from https://github.com/indygreg/collectd-carbon.git to send data to Graphite.

I`m parsing the Apache and Tomcat logfiles with logster , currently sending them only to Graphite, but logster has an option to send them to Ganglia too.

And I`m using JMXTrans to collect JMX data from Java apps that have this data exposed and send it to Graphite. (JMXTrans also comes with a Ganglia target option)

Rather than going in depth over the config it's probably easier to point to a Vagrant box I build https://github.com/KrisBuytaert/vagrant-graphite which brings up a machine that does pretty much all of this on localhost.

Obviously it's still a work in progress and lots of classes will need to be parametrized and cleaned up. But it's a working setup, and not just on my machine ..