↓ Archives ↓

Category → complexity

Methods for relocating network connectivity

Methods for redirecting network connectivity

When a client talks to a service over a network, and the server providing the service fails, or the service needs to be moved for administrative reasons, what methods are available for redirecting network references to that service refer to a different server?

This post outlines the methods that I know of for doing this.  But note the word redirect - that word is the key to all these methods.  These different methods are ways of redirecting various layers in the networking stack.  So let's first look at what happens for a normal IPv4 network connection over ethernet, and what all layers are involved, and what all places there are to redirect (or reroute) the network traffic to another server.

For simplicity's sake, we will ignore load balancers - both in hardware form, and DNS-level load balancers, and we assume a modern "switched" network.

Information Routing

What happens when a client establishes a connection to a service on a server?

Here is a brief synopsis of how connections get established.

  1. The client is given or obtains from configuration information, bookmarks, etc. a name of the server running that service.
  2. The client consults a DNS server to translate the server/service name into a 32-bit IPv4 address.
  3. The client holds onto this name/IPv4 mapping in order to optimize future references to the server name.  DNS lookup libraries normally do this themselves, but some applications also perform their own address caching.
  4. The client then examines the IPv4 address, and determines which interface and gateway to send it out on on the basis of its local configuration and the IPv4 address itself.
  5. The client OS then sends out an ARP packet to determine the 48-bit Media Access Control (MAC) address of the gateway, or the server itself, if the client and server are on the same subnet.  It may have this MAC/IP correspondence cached from earlier packets it had received from the server.
  6. The client OS sends out the packet to the MAC address determined in step 5 over the interface selected in step 4.
  7. At some earlier time, the switch network will have "learned" which switch port the corresponds to the selected MAC address.  It does this by observing which port sends packets for that given MAC address.
  8. The switch network then routes the packet to the chosen MAC address on the subnet (this could either be the MAC address of the gateway or the server - as discussed earlier).
  9. If the server is on the same subnet as the client, it receives the packet and examines the packet to see if the destination IP address is one it provides.  If it does, then all is well.  If it does not, then the packet is dropped.    So ends the "same subnet" case.
  10. Assuming the server is not on the same subnet as the client...
  11. The gateway receives the packet, and examines its routing table to decide where to route the packet to.  This is determined by the routing protocol the gateway is running - for example, OSPF or BGP.
  12. The "network cloud" routes the packet to the "final gateway" on the same subnet as the destination server.  As before, this is determined by the various routing protocol(s) along the way from the first gateway to the last one.  (This explanation is similar to the "then a miracle occurs" in the middle of a math proof).
  13. The final gateway sends out an ARP packet to determine the MAC address of the destination server.  It is typically cached for a few minutes up to an hour.
  14. The final gateway then sends the packet out to the MAC address above over the selected interface based on the routing protocol it is running.
  15. The destination server receives the packet and examines the packet to see if the destination IP address is one it provides.  If it does, then all is well.  If it does not, then the packet is dropped.

There are several address transformations that transform from one conceptual address space into another lower level address space.  These are:

  • Translation from "conceptual knowledge" of the server to the DNS name of the server.
  • Translation from the DNS server name to the destination IPv4 address
  • Translation of the destination IPv4 address to the destination gateway using routing information.
  • Translation from the destination IPv4 address to the destination MAC address.
  • Translation from MAC address to destination switch port.

Each of these transformations is a place where a redirection can occur.

  • The conceptual knowledge layer can be redirected by telling all clients to switch to a new server name.
  • The DNS layer can be redirected by updating DNS entries.
  • The network routing layer can be redirected by updating routing information in the network and pushing out the new route information.
  • The IPv4->MAC layer can be redirected by updating the ARP information and forcing the various ARP caches to be updated.
  • The MAC->switch port can be redirected by updating MAC addresses and forcing the switch network to learn the new MAC->switch correspondence.

Subsequent sections present detailed explanations of how to perform these various kinds of redirections.

Conceptual Knowledge Layer

There is no universal automated to update the conceptual knowledge layer - nevertheless, server relocations are sometimes handled at this layer.  One can use automated client update tools to update client configuration files, one can use word-of-mouth, email, or any number of ad hoc tools.  This is the least commonly used method for redirection on failure.  Arguably, since it is hard to automate, it doesn't have much place on a blog on managing with automation.

DNS layer

Updating the DNS layer can be easily automated.  The advantages are - it's universal, and little or no prior preparation has to occur, and no server/network political boundaries have to be dealt with, and the two servers don't have to be on the same subnet.  The disadvantages are - not all clients use DNS addresses, Client OS DNS caching can interfere, Client software itself can interfere by caching the address outside DNS.  Even with Dynamic DNS, it can take minutes to hours for changes to propagate and the new server address become known (and usable) to all clients.  If the client application caches the address itself, then client applications have to be restarted.  This last subcase can be difficult to automate.

Network routing layer

If a server fails, routing can be used to redirect the traffic for the failed server to another server on a different physical access segment.  The advantages of this are - the two servers don't have to be on the same network segment, routing protocols are designed to deal with this kind of situation.  The disadvantages include - if the IP address is public, then  you have to move over at least 256 addresses at a time, there are often political boundaries making it hard for servers to automatically update network routing information, the additional routes for handling a large number of such movable addresses may slow down the routers involved.

ARP layer

When a server fails, another server can bring up the IP address of the dead service, update the ARP cache (typically using gratuitous ARPs - sometimes called ARP spoofing) and packets destined for the now-dead server go to the live one.  The advantages include:  IP address takeover can occur in less than a second, there is well-tested software for doing this, most organizations have a good method for allocating and managing additional IP addresses.  The disadvantages include:  The two servers have to be on the same network segment, some organizations lock down their network gear to make this "impossible" (which it doesn't - it just slows it down), and it typically increases the number of IP addresses needed by the servers and services.

MAC->switch port layer

MAC address takeover is a technique where a given network card is given multiple MAC addresses - one for an administrative address, and one for each group of independently-failable services.  Retraining the switches to understand which switch port services the given MAC address is accomplished by simply sending any IPv4 packet with the new MAC address.  The advantages include:  Takeover can be very fast and quite reliable.  The disadvantages include: the two servers have to be on the same network segment, some organizations lock down their network gear to make this "impossible" (which it doesn't - it just slows it down), and it typically increases the number of MAC addresses needed by the servers and services, organizations almost never have methods for allocating and managing MAC addresses like they do IP addresses.

Which Method is "Best"?

I have heard it said that when you ask an engineer a question the answer is always the same regardless of the question - "It Depends".  So it is here...

  • For servers on the same VLAN (or network segment) - IP address takeover (IP address spoofing) is the most common.
  • For servers on different VLANs or network segments:
    • Network route updating - if politically and technically feasible
    • DNS updating
    • Updating the conceptual knowledge layer
Note that there are some circumstances in which there is no easy answer.  You may have to change your network configuration, solve political problems, buy an external netblock, or execute various other combinations of uncomfortable or difficult steps.

Using virtualization to provide "HA at wholesale"

Traditionally, the way people have implemented high availability is by using a high-availability management package like Linux-HA[1], then configure it in detail for each application, file system mount, IP address and so on.  This traditional method works quite well, but can be a bit labor intensive - particularly when using custom or uncommon applications.  You may have to understand the structure of your applications, write some resource agents[2], debug them, and test them in detail.  In addition, every time you change your mount structure, or other details you've told your HA system, you have to be sure and update your HA configuration to match - or it might not fail over correctly the next time.

When you have good resource agents, your HA system will also recover from application failures - by restarting applications that have failed.  This is a good thing.  On the other hand, this is enough work that virtually no one runs all their applications in an HA configuration.  It's just too much work for most applications.  I call this traditional boutique-like method "HA at retail".  It works well, but it is a little costly to set up and maintain all the details just so.

With virtualization, another approach is possible, and (big surprise), I call it "HA at wholesale".  In this paradigm, instead of needing to write scripts for each type of application, you just have one resource agent - one for managing a virtual machine.  You also don't need to know the structure of the applications - the OS still starts them in whatever way it has been starting them all along.  Wow, this sounds good - less work, fewer chances for errors!  As expected, there is still no such thing as a free lunch here - you do wind up with some disadvantages.

For example, you can no longer easily detect the failure of an application.  In addition, if an application fails, the only thing you can do about it is reboot the entire virtual machine.   Inevitably, this takes longer than just restarting the failed application.

So, HA at wholesale has these properties:
  • Simple enough that you can implement it for every machine
  • Works well for hardware failures
  • When coupled with hardware predictive failure analysis[3] and smart HA software, outages can sometimes be completely avoided.
  • Can't easily detect or recover from application failures
  • The only thing you can do about any failure is reboot the virtual machine
HA at retail has these properties:
  • It is complex enough that you need to limit how broadly you apply it in your environment
  • Works well for hardware failures
  • It can easily detect and recover from application failures
  • Individual applications can easily be restarted - and don't require a reboot
HA software like Linux-HA[1] can manage either type of environment.  In an ideal world, one would like to be able to do both in the same software infrastructure.


[1] http://linux-ha.org/
[2] http://linux-ha.org/ResourceAgent
[3] http://www-05.ibm.com/hu/termekismertetok/xseries/dn/pfa.pdf

Watch that basket!

The computing industry has lots of trends, numerous buzzwords, and a number of hot topics.  Sometimes these are in conflict with each other, or at least start out that way...  But, in the end, there are often good ways to harmonize all these various things.

Let's wander into virtual machine territory again today.  If you have gone to the trouble to create a bunch of virtual machines, the chances are you hope to do a little server consolidation - because when that's properly done it can save you some money.

This sounds good, and indeed has lots of good things going for it.  It's buzzword compliant, it's green, it saves you green (money).  What's not to like?

To see what you might not like if this is all you do, let's take an example to make it obvious...

If you put all your virtual machines on one physical server, then if that server fails, you lose all your virtual machines.  If you put ten virtual machines on one server, then the impact of that server crashing is roughly ten times as great as if a single server crashed.    If you work at it, you might be able to consolidate the ten most critical virtual machines onto a single server - and bring your entire data center to a halt with just one crash - bringing a suddenly much more personal meaning to the term "shock and awe"

This is not typically what people are looking for in their data center - and could easily be one of those career-limiting mistakes that you'd like to avoid - unless you already have your next job lined up.

This falls under the "putting all your eggs into one basket" way of doing business.  This part of a famous quote - but not the whole quote.  Mark Twain said "Put all your eggs in the one basket and --- WATCH THAT BASKET"[1].  So, to follow Mark Twain's advice, we need to not just put our eggs into one basket, we also need to watch that basket.

As most of you already know, watching servers and services is most commonly done by high-availability software - something like Linux-HA[2].  A properly configured HA system will watch the basket for you, and keep the worst from happening to your basket, your servers or your career.

As you can see, doing virtualization for reasons of consolidation doesn't make much sense unless you also add management software (HA software or otherwise) to watch your basket of virtual machines for you.

In the end, it's easy to see that all these things are connected - virtualization, server consolidation, power savings (green computing), availability management, and you want to manage them all.

[1] http://herbison.com/herbison/broken_eggs_watch.html
[2] http://linux-ha.org/

A Complete Cluster Stack for Linux

Recently, I've had some folks ask me offline what exactly would a “complete” Linux cluster stack look like.  That's a good question, and this posting is intended to address that question.

So let's start with – what kind of cluster?  For the purposes of this posting, I'm primarily talking about a full-function high-availability enterprise-style cluster, not primarily a load balancing cluster, and not a high-performance scientific (Beowulf-style) cluster.

A few caveats before proceeding – much of what I'll reference below will be relative to the Linux-HA[1] framework, but the concepts are easily translated to any other clustering framework one might have in mind.

It's also worth noting that not every application, nor every configuration needs every component.   Adding unnecessary components adds complexity, and complexity is the enemy of reliability.

Many of the components (cluster filesystem, DLM) are primarily needed by cluster-aware applications.  Note that at this time (early 2008) very few applications are cluster-aware.

The Full Cluster Stack Exposed!

Below is a picture of the full cluster stack – which I'll describe in more detail later.  For the most part, the components higher up in the picture build on the components lower down in the picture.  To simplify the drawing, I didn't add all the who-uses-whom lines that one might want to make a detailed study of this subject.

Full_cluster_stack

 

Cluster Comm - Intracluster  communications

The most basic component any cluster needs is intracluster communications. There are a variety of different possibilities, but guaranteed packet delivery is a requirement.  Linux-HA has its own custom comm layer for doing this.  It's not perfect, but it works.  At one time in the past, we provided support for the AIS cluster APIs, and if you use OpenAIS today, then you can still have a reasonable cluster using Linux-HA and providing compatible support for the AIS protocols.  As will become clear, it's not a perfect configuration, but it's a reasonable one.  (Of course, like everything, it can always be improved even more)

Very large clusters (hundreds to tens of thousands of nodes) will likely require a different communication protocol, since most guaranteed delivery multicast protocols don't scale that high. 

Nevertheless, in an ideal world, all cluster components and cluster-aware applications would sit on top of the same set of communications protocols.

Membership – who's in the cluster?

Looking to the right of the cluster comm box on our architecture chart, you'll see the membership box.  The next basic function that a cluster has to provide is membership services.  Membership closely related to communication – since a simplistic view of membership is just who we can communicate with.  It is highly desirable that everyone in the cluster be able to communicate with everyone else. It's the job of the membership layer to provide this information to the cluster.

When your communication fails in weird ways, it's the job of the membership layer to present a view of the cluster that makes sense – in spite of the weird kinds of failures that might be going on.

If we eventually wind up with multiple kinds of communications methods, then we'll also have multiple ways of becoming a member.

Linux-HA (with or without OpenAIS) supports the AIS membership APIs.

Like I mentioned for communication, in an ideal world,  all cluster components would use exactly the same membership information.  However, it is important to note that the membership one uses must be computed using the communication method being used by the application.  So, unless every cluster-aware application uses both the common communication method and the common membership, it risks getting its membership out of sync with respect to its communication and other components using other communication methods.  In many cases, this can't be avoided.  Methods for coping with such discrepencies are discussed in more detail at the end of this post.

Fencing

Fencing is the ability to “disable” nodes not currently in our membership without their cooperation  Many of you will remember having discussed this in some detail in an earlier post[2]. As I explained in more detail there, fencing  is vital to ensure safe cluster operation

Our current implementation is STONITH[3]-based - STONITH == Shoot The Other Node In The Head

Quorum

Quorum is a topic we've talked about extensively in a couple of earlier posts [2] [4]. Quorum encapsulates the ability to determine whether the cluster can continue to operate safely or not

Quorum is tied closely to both fencing and membership.  In practice, as we discussed before, it is often highly desirable to implement multiple types of quorum.  Linux-HA currently provides multiple implementations and can provide more through plugins.  Like membership and communication, it is desirable for all cluster components to use the same quorum mechanism.  All the interesting and legal ways that quorum can interact with fencing and membership and the communication layer are too detailed for this posting.

Cluster Filesystems

Cluster filesystems allow multiple machines to sanely mount the same filesystem at the same time.  This is a great boon to parallel applications.  Cluster filesystems typically don't use the network or another server involved when doing bulk I/O. Each node mounting the filesystem is normally expected to have access to the data.  This typically requires a SAN.

Typically, this is done to improve performance, but convenience and manageability are common secondary goals  Cluster filesystems are related to, but distinct from, network filesystems like NFS and CIFS. On Linux, there are several cluster filesystems available including

  • Red Hat's GFS

  • Oracle's OCFS2

  • IBM's GPFS

  • etc.

Normally, when they're being used for performance reasons, cluster-aware applications are required.  You can't typically just run 'n' copies of your favorite cluster-ignorant application and have it work.  The filesystem won't scramble the data, but your application typically will.  It goes without saying that high-performance cluster filesystems run in the kernel – unlike all the other items we've talked about before.  Because of the high-performance, it's common for a cluster filesystem to have its own communication and membership code – not using the typically userland communications and membership code.  Since membership isn't high bandwidth or really low latency information, it is possible to feed membership from a user-space membership layer into the kernel.  Of course, then the membership and the communications layer are out of sync.  It is arguable whether this is an improvement or not.

Cluster filesystems typically need cluster lock managers – described in the next section.

DLM - Distributed Lock Manager

A DLM (Distributed Lock Manager) provides locking services across the cluster, and it's an interesting piece of code to implement them – particularly the error recovery. 

To some degree, DLMs are analogous to System V semaphores but – cluster-aware.  In addition, they provide much more sophisticated API and semantics.  Although DLM APIs are fairly well understood, there is no formal standard, so switching from one to another can be annoying.  Red Hat has a reasonable kernel-based DLM which they use with GFS.  DLMs commonly have their own separate communications and membership code.  The comments about getting membership from user-space and having them be potentially different from cluster filesytems also apply here.

Cluster Volume Managers

You might think that you really don't need a cluster-aware volume manager.  Sometimes you might be right.  More often, if you thought that, you'd be wrong.   A cluster volume manager is just like a regular volume manager – only cluster aware.  This is to keep different nodes from getting inconsistent views of the layout of a set of disks or volumes.   The current cluster-aware volume managers are EVMS and CLVM.  Only CLVM is expected to survive into the long term.

The big challenges for cluster volume managers are high-performance mirroring and snapshots.  These operations are potentially very difficult to implement right and fast.  Cluster-aware volume managers often have both kernel and user-space components.  The membership inconsistency issues here are similar to those for cluster filesystems and the DLM.

CRM - Cluster Resource Manager

Every HA cluster has something like a CRM, but they may divide up these functions differently.  Our CRM is a policy-based decision maker for what should run where – handling failed services and failed cluster nodes.

The CRM is similar to UNIX/Linux startup init scripts – it starts everything up – but across a cluster following some policies, and managing failures.

The Linux-HA CRM is arguably the best cluster resource manager around today – at least in terms of flexibility and power.  It has usability issues, and can be extended, but those are solvable.

The Linux-HA CRM function is largely divided between the PE and TE – which are described below.

PE - Policy Engine

The Policy Engine is a key component of the CRM and does two distinct things.

  • It determines what should run where (cluster layout)

  • It creates a graph of actions of how to get from the current state of affairs to the new desired state

This graph of actions is then given to the TE (described below).

The system would have more flexibility if he PE were split into two parts for these two functions, and supported plugins for the cluster layout function. 

It currently isn't aware of resource cost, nor of absolute resource limits and load balancing considerations, which complicate optimal placement.   Those would be good things to add to it in the future.  Having plugins for doing resource placement would also be a highly useful and desirable thing.

TE - Transition Engine

Receives a graph of actions to perform from the policy engine, then uses the LRM proxy to communicate with the LRMs to carry out the actions

Its main jobs are action sequencing, error detection and reporting

CIB - Cluster Information Base

The CIB manages information on cluster configuration and current status.  The cluster configuration includes the configuration and policies as defined by the system administrator.

Its key difficulty is to keep a consistent copy replicated across the cluster, resolving potential version differences.

All the data it manages is XML, and the CIB has a minimal knowledge of the structure of this XML.

LRM - Local Resource Manager

In the Linux-HA architecture, a local resource manager runs on every machine and carries out the tasks given to it. Everything that gets done gets carried out by the LRM.  Examples are:

  • start this resource

  • stop this resource

  • monitor this resource

  • migrate this resource

  • etc.

The LRM provides interface matching to the various kinds of resources through Resource Agents.  The Linux-HA LRM supports several classes of Resource Agents.

The LRM is not at all cluster-aware.  It can support an arbitrary number of clients, one of which is the LRM communications proxy (below).

LRM Communications Proxy

The LRM proxy communicates between the CRM and the LRMs on all the various machines.  This function is currently built into the CRM.  This architectural decision was based on expedience more than anything else.

To support larger clusters this needs to be separated out, made more scalable, and more flexible.  This would allow a large number of LRMs to be supported by a small number of LRM proxies.   In large systems, this would probably use the ClusterIP capability to provide load distribution (leveling) across multiple LRM proxies.

Init - Initialization and recovery

This code does really three things:

  • Sequences the startup of the cluster components

  • Recovers from component failures (restart or reboot)

  • Sequences the shutdown of all the various cluster components

This is currently provided by Linux-HA and bundled with the Linux-HA communications code.  This likely needs to be separated out to a separate proxy function (process) in the future.

Infrastructure

The Linux-HA infrastructure libraries (“clplumbing”) does a wide variety of things.  A few samples include:

  • Inter-Process Communication

  • Process management

  • Event management and scheduling

  • Many other miscellaneous functions

Surprisingly, these libraries amount to about 20K lines of code.

Quorum Daemon

The quorum daemon is an unusual daemon, because it's the only daemon we have that's intended to run outside the cluster proper.  It is instrumental in solving certain knotty quorum problems – especially for:

  • 2-node clusters (very common)

  • Split-site (disaster recovery) clusters

This was discussed extensively in previous postings [4] and [5].

Management Daemon

Provides Complete Authenticated Configuration and Status API.  This includes both information contained in the CIB, and also information about the communications configuration and so on.

The management daemon is used by the:

  • GUI

  • SNMP agent

  • CIM agent

Clients are authenticated using PAM, and all communications is via SSL, so its clients can safely be outside cluster, or even outside a firewall.  This daemon should provide different levels of authorization depending on the authenticated user, and should log its actions in a format suitable for Sarbanes-Oxley (SOX) auditing purposes.

GUI

Provides a Graphical User Interface providing configuration and status information.  It also supports creating and configuring the cluster.  Note that at the present time, there are a number of useful cluster configurations it cannot create.

CIM and SNMP agents

The CIM and SNMP agents provide CIM and SNMP management interfaces for systems management tools.  The CIM interface supports status updates and configuration changes, whereas the SNMP interfaces only report status.

Disadvantages of this architecture

For a variety of reasons, kernel space doesn't have access to user-space cluster communications or membership.

As a result, both the DLM and most cluster filesytems implements their own membership and communications.

This is in contradiction to the “ideal world” statements earlier.  This can result in some odd cases where one communication method is working in a particular case, but another method is not.  This results in differences in membership – which can have bad effects.

Why this might not be quite as bad as it seems

One reason why one might not worry about this as much as one might, is because it's a problem which one can't make go away.  A cluster system will always have to interface with software packages which do their own communication, and compute their own membership for a variety of usually good reasons.  As a result, this is a problem which we can't make go away.  Instead we have to deal with it effectively.  There are basically two cases to consider:

  1. The “Main” membership thinks that node X should not be in the cluster, whereas the “Other” membership thinks it should be.

  2. The “Other” membership thinks that node X should not be in the cluster, whereas the “Main” membership thinks it should be.

Let's take these two cases one at a time:

Case 1:

If the main membership thinks node X is not in the cluster, then it will simply not start any resources on node X.  This takes care of the problem.

Case 2:

If the “Other” membership discovers that a particular node should be dropped from its view of membership, and it can inform the CRM not to start its resources on that machine, then the local view of this membership from the perspective of the resources it deals with is effectively made to exclude these Other-errant nodes.  In the Linux-HA CRM this is easily done having the Other-resources write node attributes to cause those nodes to be excluded, and the rules would then be written to exclude those nodes from consideration for running Other-related resources.

Although Case 2 isn't pretty, it works, and no amount of wishing and hoping is likely to ever make this kind of problem go away in the general case - particularly when one involves proprietary applications  So, even if there is some membership discrepancy, it can is always possible to manage it appropriately assuming you can get a tiny bit of cooperation from the application.

References

[1] http://linux-ha.org/
[2] http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html

[3] http://linux-ha.org/STONITH
[4] http://techthoughts.typepad.com/managing_computers/2007/10/more-about-quor.html
[5] http://techthoughts.typepad.com/managing_computers/2007/11/quorum-server-i.html

How Managed Virtualization (including HA) conflicts with System Management

Managed Virtualization Versus System Management

In an earlier post[1], I talked about a couple of kinds of virtualization, comparing two of them and highlighting their strengths.  This posting discusses how virtualization can confuse and confound conventional systems management - both automated and manual, and gives some thoughts on how to deal with it.

We all know that virtualization is a GoodThing(TM).  Therefore, it can't really have any disadvantages, can it?  <tongue-in-cheek-off> Unfortunately, it does have disadvantages.  The great strength of virtualization is its ability to break the ties between a service or operating system and the server which implements its service.  Many software systems and a good number of human beings find this confusing.  If I want to reboot a physical server, what services or operating systems will be disrupted by the reboot?

Conversely, if I want to do something to the machine that's running a particular service, which machine do I have to log into?  If you're running both service virtualization (conventional HA like Linux-HA[2]) on top of server virtualization (ala Xen or VMware), then you have a doubly difficult task - first you have to figure out which virtual machine is running a service, then you have to figure out which physical machine is running that particular virtual machine.

This can be really annoying and can easily result in system administrators[3] making mistakes either in the middle of the night, or when under pressure (which all sysadmins know is pretty much all the time).

Remember - Complexity is the Enemy of Reliability.   This is just another example of my favorite phrase at work.

And, if you want to have server monitoring software which tries to figure out whether a service is stopped and have it restart it, then it can also get confused by the fact that all these stupid servers and services are always moving around.  They just won't stay put!  Back in the olden days, you logged into a server and you edited the inittab, and you always knew what hardware it was running on and what server it was.  Now, with virtualization, and especially with virtualization management software, you never know what's where.

A Recipe for Chaos and Conflict

Your HA software and/or your virtualization management software can move things around on you.  Imagine that you have these four kinds of things in your data center:

  • High-Availability (HA/service-virtualization) management software

  • Virtualization management software

  • System management monitoring software

  • Human system administrators

This is a recipe for chaos, interspersed with the occasional career-limiting disaster. It's this kind of thing that leads system administrators to pull their hair out, and keep their resumes up to date.  None of these is bad by itself, in fact, each is a GoodThing(TM).  But they don't normally play well with each other. In typical myopic software design fashion, each of these layers is usually unaware of the other layers (except, of course for the last (human) layer - who has to make up for all the poor integration).

In addition, since the software layers typically aren't aware of all this wonderful virtualization going on, they can't really deal with the picture reliably.  They don't know what should be happening where, because it isn't fixed.  The various virtualization management packages keep changing things!

So, what's a body to do?  As far as I know, there are two basic options.

  1. Integrate the four layers of management with each other using things like CIM[4] and SNMP[5]

  2. Empower your HA software to also manage the server virtualization of your data center

Integration of Layers

Virtually every data center (sadly, pun intended) has a variety of server types and a variety of operating systems, and a variety of management software.  They mostly don't play well with each other.  Almost the only way to get them to play together - even if imperfectly - is to have them talk together using industry standard protocols.

Today, that means using SNMP or CIM.   Here is my personal view on the characteristics of these two protocols for your consideration.

  • SNMP - widely deployed - implemented in a truly compatible way, but far too weak for a job this hard.  SNMP is great for grabbing statistics, checking whether a server or router is up and what kind of load it is seeing in great detail.  Anything much beyond this, and the MIBs become 100% vendor-specific - meaning that cross-vendor integration breaks down - basically completely.  For HA clustering or virtualization management or worse yet the combination of the two - forget it.

  • CIM - widely deployed in expensive disk subsystems - but rarely deployed outside that.  It has newly developed models for virtualization and clustering, but like most standards they're mostly lowest-common-denominator standards, and unfortunately not widely deployed.  For example, Linux-HA[2] implements CIM, but unfortunately Linux-HA has tremendous power and capability which CIM can't begin to model.  So, this winds up being only possible to model using vendor-specific extensions - greatly weakening the possible integrations.

Now, I'm not saying that these two protocols are useless - far from it. Without open standards like CIM and SNMP, the prospect truly is hopeless.   But I am saying that integrating them in the typical-for-the-industry highly-heterogeneous data center is a challenge, and the more layers there are to integrate, the bigger the challenge.  Since standards necessarily trail industry practice, the more "bleeding edge" the topic (i.e., HA clustering or virtualization) and the more powerful the underlying tool (like Linux-HA), the greater the mismatch.

Here we have two bleeding edge topics and four layers.  Yikes!  Surely there must be some kind of alternative to this somewhat-unattractive mess.

Decrease The Layers and Let Them Manage Themselves

As I mentioned in my earlier virtualization posting, some HA packages (like Linux-HA) can also manage virtualization simultaneously.  So, one way of dealing with this is to let (or extend) your service virtualization product also manage your server virtualization.  One advantage of this approach is that service virtualization software (HA software) is comparatively mature technology, minimizing the risk.

Unfortunately, this doesn't yet go all the way in solving the problem either.  There are a few things that should change to make this really work well. These include

  • Support much larger HA clusters - hundreds to thousands of nodes.  In an ideal world, you'd really like fewer of these HA/virtualization clusters as you can get.  Today you'd typically have to have one of these clusters for every 8-32 physical servers - which makes an awfully lot of these things to manage in a data center containing hundreds or thousands of servers.

  • Integrate with many virtualization layers - Such a product would need to integrate with Xen, IBM System Z, IBM System P, Linux KVM, VMware, and future virtualization layers like the one promised by Microsoft.   This isn't rocket science, but by the time you're done, it will be some work.

  • Support monitoring and controlling services inside the virtual machine - Otherwise you haven't really integrated the two layers - and you wind up running some HA software inside some of the virtual machines.  Again, this isn't rocket science, but it will require some work[1] for each operating system you want to manage services for.

  • Integrate with provisioning systems - so that you can add and delete virtual machines and allocate disk to them and their applications with fewer possibilities for error, and more automation.

None of these items are technically difficult, and none of them are prohibitively expensive to implement.  Given that I'm the project leader for Linux-HA, and Linux-HA is one of the most capable HA products around, you might imagine that some of these thoughts are on my mind for our future  ;-).  Of course, that doesn't eliminate the necessity for integration with the remaining layers above, which is why Linux-HA implements both CIM and SNMP.  This allows the virtualization management infrastructure to actively and autonomically manage  servers and services, while letting it bubble up events (especially those it can't automatically recover from) to the management consoles and humans via protocols like SNMP and/or CIM.

Conclusions

Virtualization technologies add complexity to the data center along with the benefits they bring, and in the process may render the existing management facilities less than useful.  However, if HA and Virtualization management are performed by a single entity, and open standards like CIM and SNMP are used, systems can be active the problems can be minimized.

See Also

Preparing for Virtual Management http://www.itbusinessedge.com/blogs/dcc/?p=276

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/virtualization-.html
[2] http://linux-ha.org/
[3] http://linux-ha.org/SysAdmin
[4] http://www.dmtf.org/standards/cim/
[4] http://en.wikipedia.org/wiki/Common_Information_Model_%28computing%29
[5] http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

Availability, MTBF, MTTR and other bedtime tales

If we let A represent availability, then the simplest formula for availability is:
    A = Uptime/(Uptime + Downtime)

Of course, it's more interesting when you start looking at the things that influence uptime and downtime.  The most common measures that can be used in this way are MTBF and MTTR.

    MTBF is  Mean Time Between Failures
    MTTR is Mean Time To Repair   

 A = MTBF / (MTBF+MTTR)

One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.

That's exactly what HA clustering tries to do.  It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can.   Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service.  If it's not observable by the client, then in some sense it didn't happen at all.  This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.

It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other.  Failure of one component in the system may not cause failure of the system.  Indeed, good HA design eliminates single points of failure by introducing redundancy.  If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.

    MTBFx is  Mean Time Between Failures for entity x
    MTTRx is Mean Time To Repair for entity x
    Ax is the Availability of entity x   

Ax = MTBFx / (MTBFx+MTTRx)

In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely.  So, why did I spend your time talking about it?  That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.

Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.

Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri).  With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2.  If you compute the availability of the cluster, it then becomes:

    A = Mi/2 / (Mi/2+Ri)

Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.

    A = Mi/1000 / (Mi/1000+Ri)

If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.

    A = 0/(0+Ri) = 0/Ri = 0

This makes it appear that adding cluster nodes decreases availability.  Is this really true?  Of course not!  The mistake here is thinking that the service needed all those  cluster nodes to make it go.  If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct.  But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question.  Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.

To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails.  Here are a few rules of thumb for thinking about availability

  • Complexity is the enemy of reliability (MTTR).  This can take many forms
    • Complex software fails more often than simple software
    • Complex hardware fails more often than simple hardware
    • Software dependencies usually mean that if any component fails, the whole service fails
    • Configuration complexity lowers the chances of the configuration being correct
    • Complexity drastically increases the possibility of human error
      • What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
  • Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR.  Replication is another word for redundancy.
  • Good failure detection is vital - HA and other autonomic software can only recover from failures it detects.  Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR.  They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal.  In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
  • Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability.  These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software.  More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.

The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.