Category → HA theory
Methods for relocating network connectivity
Methods for redirecting network connectivity
When a client talks to a service over a network, and the server providing the service fails, or the service needs to be moved for administrative reasons, what methods are available for redirecting network references to that service refer to a different server?
This post outlines the methods that I know of for doing this. But note the word redirect - that word is the key to all these methods. These different methods are ways of redirecting various layers in the networking stack. So let's first look at what happens for a normal IPv4 network connection over ethernet, and what all layers are involved, and what all places there are to redirect (or reroute) the network traffic to another server.
For simplicity's sake, we will ignore load balancers - both in hardware form, and DNS-level load balancers, and we assume a modern "switched" network.
Information Routing
What happens when a client establishes a connection to a service on a server?
Here is a brief synopsis of how connections get established.
- The client is given or obtains from configuration information, bookmarks, etc. a name of the server running that service.
- The client consults a DNS server to translate the server/service name into a 32-bit IPv4 address.
- The client holds onto this name/IPv4 mapping in order to optimize future references to the server name. DNS lookup libraries normally do this themselves, but some applications also perform their own address caching.
- The client then examines the IPv4 address, and determines which interface and gateway to send it out on on the basis of its local configuration and the IPv4 address itself.
- The client OS then sends out an ARP packet to determine the 48-bit Media Access Control (MAC) address of the gateway, or the server itself, if the client and server are on the same subnet. It may have this MAC/IP correspondence cached from earlier packets it had received from the server.
- The client OS sends out the packet to the MAC address determined in step 5 over the interface selected in step 4.
- At some earlier time, the switch network will have "learned" which switch port the corresponds to the selected MAC address. It does this by observing which port sends packets for that given MAC address.
- The switch network then routes the packet to the chosen MAC address on the subnet (this could either be the MAC address of the gateway or the server - as discussed earlier).
- If the server is on the same subnet as the client, it receives the packet and examines the packet to see if the destination IP address is one it provides. If it does, then all is well. If it does not, then the packet is dropped. So ends the "same subnet" case.
- Assuming the server is not on the same subnet as the client...
- The gateway receives the packet, and examines its routing table to decide where to route the packet to. This is determined by the routing protocol the gateway is running - for example, OSPF or BGP.
- The "network cloud" routes the packet to the "final gateway" on the same subnet as the destination server. As before, this is determined by the various routing protocol(s) along the way from the first gateway to the last one. (This explanation is similar to the "then a miracle occurs" in the middle of a math proof).
- The final gateway sends out an ARP packet to determine the MAC address of the destination server. It is typically cached for a few minutes up to an hour.
- The final gateway then sends the packet out to the MAC address above over the selected interface based on the routing protocol it is running.
- The destination server receives the packet and examines the packet to see if the destination IP address is one it provides. If it does, then all is well. If it does not, then the packet is dropped.
There are several address transformations that transform from one conceptual address space into another lower level address space. These are:
- Translation from "conceptual knowledge" of the server to the DNS name of the server.
- Translation from the DNS server name to the destination IPv4 address
- Translation of the destination IPv4 address to the destination gateway using routing information.
- Translation from the destination IPv4 address to the destination MAC address.
- Translation from MAC address to destination switch port.
Each of these transformations is a place where a redirection can occur.
- The conceptual knowledge layer can be redirected by telling all clients to switch to a new server name.
- The DNS layer can be redirected by updating DNS entries.
- The network routing layer can be redirected by updating routing information in the network and pushing out the new route information.
- The IPv4->MAC layer can be redirected by updating the ARP information and forcing the various ARP caches to be updated.
- The MAC->switch port can be redirected by updating MAC addresses and forcing the switch network to learn the new MAC->switch correspondence.
Subsequent sections present detailed explanations of how to perform these various kinds of redirections.
Conceptual Knowledge Layer
There is no universal automated to update the conceptual knowledge layer - nevertheless, server relocations are sometimes handled at this layer. One can use automated client update tools to update client configuration files, one can use word-of-mouth, email, or any number of ad hoc tools. This is the least commonly used method for redirection on failure. Arguably, since it is hard to automate, it doesn't have much place on a blog on managing with automation.DNS layer
Updating the DNS layer can be easily automated. The advantages are - it's universal, and little or no prior preparation has to occur, and no server/network political boundaries have to be dealt with, and the two servers don't have to be on the same subnet. The disadvantages are - not all clients use DNS addresses, Client OS DNS caching can interfere, Client software itself can interfere by caching the address outside DNS. Even with Dynamic DNS, it can take minutes to hours for changes to propagate and the new server address become known (and usable) to all clients. If the client application caches the address itself, then client applications have to be restarted. This last subcase can be difficult to automate.
Network routing layer
If a server fails, routing can be used to redirect the traffic for the failed server to another server on a different physical access segment. The advantages of this are - the two servers don't have to be on the same network segment, routing protocols are designed to deal with this kind of situation. The disadvantages include - if the IP address is public, then you have to move over at least 256 addresses at a time, there are often political boundaries making it hard for servers to automatically update network routing information, the additional routes for handling a large number of such movable addresses may slow down the routers involved.ARP layer
When a server fails, another server can bring up the IP address of the dead service, update the ARP cache (typically using gratuitous ARPs - sometimes called ARP spoofing) and packets destined for the now-dead server go to the live one. The advantages include: IP address takeover can occur in less than a second, there is well-tested software for doing this, most organizations have a good method for allocating and managing additional IP addresses. The disadvantages include: The two servers have to be on the same network segment, some organizations lock down their network gear to make this "impossible" (which it doesn't - it just slows it down), and it typically increases the number of IP addresses needed by the servers and services.
MAC->switch port layer
MAC address takeover is a technique where a given network card is given multiple MAC addresses - one for an administrative address, and one for each group of independently-failable services. Retraining the switches to understand which switch port services the given MAC address is accomplished by simply sending any IPv4 packet with the new MAC address. The advantages include: Takeover can be very fast and quite reliable. The disadvantages include: the two servers have to be on the same network segment, some organizations lock down their network gear to make this "impossible" (which it doesn't - it just slows it down), and it typically increases the number of MAC addresses needed by the servers and services, organizations almost never have methods for allocating and managing MAC addresses like they do IP addresses.
Which Method is "Best"?
I have heard it said that when you ask an engineer a question the answer is always the same regardless of the question - "It Depends". So it is here...
- For servers on the same VLAN (or network segment) - IP address takeover (IP address spoofing) is the most common.
- For servers on different VLANs or network segments:
- Network route updating - if politically and technically feasible
- DNS updating
- Updating the conceptual knowledge layer
Using virtualization to provide "HA at wholesale"
When you have good resource agents, your HA system will also recover from application failures - by restarting applications that have failed. This is a good thing. On the other hand, this is enough work that virtually no one runs all their applications in an HA configuration. It's just too much work for most applications. I call this traditional boutique-like method "HA at retail". It works well, but it is a little costly to set up and maintain all the details just so.
With virtualization, another approach is possible, and (big surprise), I call it "HA at wholesale". In this paradigm, instead of needing to write scripts for each type of application, you just have one resource agent - one for managing a virtual machine. You also don't need to know the structure of the applications - the OS still starts them in whatever way it has been starting them all along. Wow, this sounds good - less work, fewer chances for errors! As expected, there is still no such thing as a free lunch here - you do wind up with some disadvantages.
For example, you can no longer easily detect the failure of an application. In addition, if an application fails, the only thing you can do about it is reboot the entire virtual machine. Inevitably, this takes longer than just restarting the failed application.
So, HA at wholesale has these properties:
- Simple enough that you can implement it for every machine
- Works well for hardware failures
- When coupled with hardware predictive failure analysis[3] and smart HA software, outages can sometimes be completely avoided.
- Can't easily detect or recover from application failures
- The only thing you can do about any failure is reboot the virtual machine
- It is complex enough that you need to limit how broadly you apply it in your environment
- Works well for hardware failures
- It can easily detect and recover from application failures
- Individual applications can easily be restarted - and don't require a reboot
[1] http://linux-ha.org/
[2] http://linux-ha.org/ResourceAgent
[3] http://www-05.ibm.com/hu/termekismertetok/xseries/dn/pfa.pdf
Watch that basket!
The computing industry has lots of trends, numerous buzzwords, and a number of hot topics. Sometimes these are in conflict with each other, or at least start out that way... But, in the end, there are often good ways to harmonize all these various things.
Let's wander into virtual machine territory again today. If you have gone to the trouble to create a bunch of virtual machines, the chances are you hope to do a little server consolidation - because when that's properly done it can save you some money.
This sounds good, and indeed has lots of good things going for it. It's buzzword compliant, it's green, it saves you green (money). What's not to like?
To see what you might not like if this is all you do, let's take an example to make it obvious...
If you put all your virtual machines on one physical server, then if that server fails, you lose all your virtual machines. If you put ten virtual machines on one server, then the impact of that server crashing is roughly ten times as great as if a single server crashed. If you work at it, you might be able to consolidate the ten most critical virtual machines onto a single server - and bring your entire data center to a halt with just one crash - bringing a suddenly much more personal meaning to the term "shock and awe"
This is not typically what people are looking for in their data center - and could easily be one of those career-limiting mistakes that you'd like to avoid - unless you already have your next job lined up.
This falls under the "putting all your eggs into one basket" way of doing business. This part of a famous quote - but not the whole quote. Mark Twain said "Put all your eggs in the one basket and --- WATCH THAT BASKET"[1]. So, to follow Mark Twain's advice, we need to not just put our eggs into one basket, we also need to watch that basket.
As most of you already know, watching servers and services is most commonly done by high-availability software - something like Linux-HA[2]. A properly configured HA system will watch the basket for you, and keep the worst from happening to your basket, your servers or your career.
As you can see, doing virtualization for reasons of consolidation doesn't make much sense unless you also add management software (HA software or otherwise) to watch your basket of virtual machines for you.
In the end, it's easy to see that all these things are connected - virtualization, server consolidation, power savings (green computing), availability management, and you want to manage them all.
[1] http://herbison.com/herbison/broken_eggs_watch.html
[2] http://linux-ha.org/
Virtual machine snapshots considered (nearly) worthless…
With apologies to Edgar Dijkstra...
Usually when people talk about virtual machine snapshotting, they include with it snapshotting both the server and any filesystems its directly connected to. Although this is more complex than just snapshotting the virtual machine, it isn't that hard.
This works in some very narrow technical sense for some few cases, but it involves loss of data in every case. If you take a checkpoint every 30 minutes (or every 5 or whatever), then all the updates made during that period of time, are lost when you restore this snapshot and its storage to a consistent (but old) state. This means that all the checks you deposited during that time, or all the bonuses your boss put you in for during that time, or the books you ordered, or whatever, are lost. Lost to the point that they probably have to be restored manually - to the tune of great customer dissatisfaction.
In addition, if this application has connections as a client, or as a server to other servers or clients, then although the application and its immediately mounted storage are now consistent, but unless you do simultaneous snapshots between this virtual machines and all the world it is connecting with (some of which may be outside your enterprise), and then restore your entire world to this older state, then there are likely to be many client/server connections which will no longer work - because the client and server are in mutually inconsistent states.
The worst case of this is if you have a Service Oriented Architecture, where any given server is only a small part of the overall service - every service has connections to something else all the time, and to make matters worse, the clients and/or servers are often outside your own enterprise.
And, of course, don't forget that you lost transactions in the process too. So, a reboot interval of 1 to 3 minutes sounds really good by comparision. Because all you'll lose in that case is transactions that were not yet committed - which are many fewer than the number of transactions lost by backing up to the previous checkpoint.
As an example of a common special case where this obviously doesn't work, imagine that the server in question is a file server. So, you restore the virtual machine and all its storage (the file server) to some older state. Now all the connected applications which _thought_ they had committed some particular piece of work (a spreadsheet, a database transaction) - just had all that work undone. And, depending on the file server protocol and the application, bad things will happen - certainly loss of data, and probably some of the applications will create corrupt data - since updates they thought they'd made are now gone - unbeknownst to them. This corrupt data can cause any number of problems - inability to make further updates, cascading application crashes - these are all possibilities.
Or what if it's a client of a file server? The file server is a separate machine (possibly virtual, possibly real, possibly an appliance). Then you can't put its storage state back to a known state - without restoring all its clients back to the same consistent state - and if you somehow did, then _all_ of them now suffer data loss.
Not a very pretty picture.
There are some few cases where you can isolate the application from the "real world" and snapshot the whole "mini-enterprise" in a synchronous way. Those are mostly limited to large scale scientific applications. Given how hard it is to make them more available in any other way, this is a good thing. But, its a practice with narrow applicability. After reading the paragraphs above, perhaps you can see why...
Quorum Server Illustrated – updated
In two earlier posts [1] [2], I gave brief descriptions of the quorum server which seem to have left as much confusion as they provided clarity. This post is only about the Linux-HA quorum server, and includes illustrations for clarity.
The Linux-HA Quorum API
In the Linux-HA quorum API, you can configure a number of quorum modules which are used as follows. If a quorum module returns HAVEQUORUM, then the cluster has quorum. If it returns NOQUORUM then the cluster does not have quorum. If a quorum module returns QUORUMTIE, then the next quorum module in the list is consulted. If the final module returns QUORUMTIE, then it is treated as a NOQUORUM event.
The quorum daemon is normally used in conjunction with the nomal arithmetic voting quorum module, so that it is only consulted when the number of nodes in the cluster is exactly half the number of configured modules in the system. So, it is worth noting that the quorum server will never be consulted if a cluster has an odd number of nodes.
Quorum Server Scenarios
Below, I'll go through the basic quorum server cases so you can see how all this works in more detail - with pictures, even!
Normal Situation - Everything up
In the picture above, everything is normal. The quorum server is up, and both sites are also up. Because the cluster has all its nodes up, the quorum server is irrelevant.
In the situation above, we show the "New Jersey" site as down. In this case, the conventional voting quorum has a tie (1/2 - exactly half of the nodes). In this case the quourm server is consulted. Since only New York is talking to the quorum server, the quorum server grants quorum to the New York site.
In the case above, the link between the sites has been lost, but both sites and the quorum server are all up. In this case, both New York and New Jersey contact the quorum server because each sees 1/2 nodes as being up - resulting in a tie condition.
In this case, the quorum server will choose one of the two sites to provide quorum to, and I assume in this case that New York was chosen. Because New Jersey wasn't granted quorum, it will shut its resources down.
What happens when the quorum server goes down?
That is the situation shown above. Because New York and New Jersey are both up, they have 2/2 votes and both provide service as they should. This illustrates the point that the quorum server is not a single point of failure.
Multiple Failures -> Loss of Service
In this final case, multiple failures have occurred - both New Jersey and the quorum server are down. In this case, New York doesn't have quorum, so it shuts down services and none are provide by any node in the cluster. Of course, this situation can be overridden in the cluster configuration by changing the quorum policy, but from an automated perspective, this is all that can be (should be) done.
Security Concerns
If you want to run your quorum server communications across networks which mig
Availability, MTBF, MTTR and other bedtime tales
If we let A represent availability, then the simplest formula for availability is:
A = Uptime/(Uptime + Downtime)
Of course, it's more interesting when you start looking at the things that influence uptime and downtime. The most common measures that can be used in this way are MTBF and MTTR.
MTBF is Mean Time Between Failures
MTTR is Mean Time To Repair
A = MTBF / (MTBF+MTTR)
One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.
That's exactly what HA clustering tries to do. It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can. Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service. If it's not observable by the client, then in some sense it didn't happen at all. This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.
It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other. Failure of one component in the system may not cause failure of the system. Indeed, good HA design eliminates single points of failure by introducing redundancy. If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.
MTBFx is Mean Time Between Failures for entity x
MTTRx is Mean Time To Repair for entity x
Ax is the Availability of entity x
Ax = MTBFx / (MTBFx+MTTRx)
In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely. So, why did I spend your time talking about it? That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.
Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.
Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri). With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2. If you compute the availability of the cluster, it then becomes:
A = Mi/2 / (Mi/2+Ri)
Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.
A = Mi/1000 / (Mi/1000+Ri)
If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.
A = 0/(0+Ri) = 0/Ri = 0
This makes it appear that adding cluster nodes decreases availability. Is this really true? Of course not! The mistake here is thinking that the service needed all those cluster nodes to make it go. If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct. But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question. Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.
To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails. Here are a few rules of thumb for thinking about availability
- Complexity is the enemy of reliability (MTTR). This can take many forms
- Complex software fails more often than simple software
- Complex hardware fails more often than simple hardware
- Software dependencies usually mean that if any component fails, the whole service fails
- Configuration complexity lowers the chances of the configuration being correct
- Complexity drastically increases the possibility of human error
- What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
- Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR. Replication is another word for redundancy.
- Good failure detection is vital - HA and other autonomic software can only recover from failures it detects. Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR. They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal. In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
- Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability. These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software. More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.
The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.


