Category → quorum
A Complete Cluster Stack for Linux
Recently, I've had some folks ask me offline what exactly would a “complete” Linux cluster stack look like. That's a good question, and this posting is intended to address that question.
So let's start with – what kind of cluster? For the purposes of this posting, I'm primarily talking about a full-function high-availability enterprise-style cluster, not primarily a load balancing cluster, and not a high-performance scientific (Beowulf-style) cluster.
A few caveats before proceeding – much of what I'll reference below will be relative to the Linux-HA[1] framework, but the concepts are easily translated to any other clustering framework one might have in mind.
It's also worth noting that not every application, nor every configuration needs every component. Adding unnecessary components adds complexity, and complexity is the enemy of reliability.
Many of the components (cluster filesystem, DLM) are primarily needed by cluster-aware applications. Note that at this time (early 2008) very few applications are cluster-aware.
The Full Cluster Stack Exposed!
Below is a picture of the full cluster stack – which I'll describe in more detail later. For the most part, the components higher up in the picture build on the components lower down in the picture. To simplify the drawing, I didn't add all the who-uses-whom lines that one might want to make a detailed study of this subject.
Cluster Comm - Intracluster communications
The most
basic component any cluster needs is intracluster communications.
There are a variety of different possibilities, but guaranteed packet
delivery is a requirement. Linux-HA has its own custom comm layer
for doing this. It's not perfect, but it works. At one time in the
past, we provided support for the AIS cluster APIs, and if you use
OpenAIS today, then you can still have a reasonable cluster using
Linux-HA and providing compatible support for the AIS protocols. As
will become clear, it's not a perfect configuration, but it's a
reasonable one. (Of course, like everything, it can always be improved even more)
Very large clusters (hundreds to tens of thousands of nodes) will likely require a different communication protocol, since most guaranteed delivery multicast protocols don't scale that high.
Nevertheless, in an ideal world, all cluster components and cluster-aware applications would sit on top of the same set of communications protocols.
Membership – who's in the cluster?
Looking to the right of the cluster comm box on our architecture chart, you'll see the membership box. The next basic function that a cluster has to provide is membership services. Membership closely related to communication – since a simplistic view of membership is just who we can communicate with. It is highly desirable that everyone in the cluster be able to communicate with everyone else. It's the job of the membership layer to provide this information to the cluster.
When your communication fails in weird ways, it's the job of the membership layer to present a view of the cluster that makes sense – in spite of the weird kinds of failures that might be going on.
If we eventually wind up with multiple kinds of communications methods, then we'll also have multiple ways of becoming a member.
Linux-HA (with or without OpenAIS) supports the AIS membership APIs.
Like I
mentioned for communication, in an ideal world, all cluster
components would use exactly the same membership information. However, it is important to note that the membership one uses must be computed using the communication method being used by the application. So, unless every cluster-aware application uses both the common communication method and the common membership, it risks getting its membership out of sync with respect to its communication and other components using other communication methods. In many cases, this can't be avoided. Methods for coping with such discrepencies are discussed in more detail at the end of this post.
Fencing
Fencing is the ability to “disable” nodes not currently in our membership without their cooperation Many of you will remember having discussed this in some detail in an earlier post[2]. As I explained in more detail there, fencing is vital to ensure safe cluster operation
Our current implementation is STONITH[3]-based - STONITH == Shoot The Other Node In The Head
Quorum
Quorum is a topic we've talked about extensively in a couple of earlier posts [2] [4]. Quorum encapsulates the ability to determine whether the cluster can continue to operate safely or not
Quorum is tied closely to both fencing and membership. In practice, as we discussed before, it is often highly desirable to implement multiple types of quorum. Linux-HA currently provides multiple implementations and can provide more through plugins. Like membership and communication, it is desirable for all cluster components to use the same quorum mechanism. All the interesting and legal ways that quorum can interact with fencing and membership and the communication layer are too detailed for this posting.
Cluster Filesystems
Cluster filesystems allow multiple machines to sanely mount the same filesystem at the same time. This is a great boon to parallel applications. Cluster filesystems typically don't use the network or another server involved when doing bulk I/O. Each node mounting the filesystem is normally expected to have access to the data. This typically requires a SAN.
Typically, this is done to improve performance, but convenience and manageability are common secondary goals Cluster filesystems are related to, but distinct from, network filesystems like NFS and CIFS. On Linux, there are several cluster filesystems available including
Red Hat's GFS
Oracle's OCFS2
IBM's GPFS
etc.
Normally, when they're being used for performance reasons, cluster-aware applications are required. You can't typically just run 'n' copies of your favorite cluster-ignorant application and have it work. The filesystem won't scramble the data, but your application typically will. It goes without saying that high-performance cluster filesystems run in the kernel – unlike all the other items we've talked about before. Because of the high-performance, it's common for a cluster filesystem to have its own communication and membership code – not using the typically userland communications and membership code. Since membership isn't high bandwidth or really low latency information, it is possible to feed membership from a user-space membership layer into the kernel. Of course, then the membership and the communications layer are out of sync. It is arguable whether this is an improvement or not.
Cluster filesystems typically need cluster lock managers – described in the next section.
DLM - Distributed Lock Manager
A DLM (Distributed Lock Manager) provides locking services across the cluster, and it's an interesting piece of code to implement them – particularly the error recovery.
To some degree, DLMs are analogous to System V semaphores but – cluster-aware. In addition, they provide much more sophisticated API and semantics. Although DLM APIs are fairly well understood, there is no formal standard, so switching from one to another can be annoying. Red Hat has a reasonable kernel-based DLM which they use with GFS. DLMs commonly have their own separate communications and membership code. The comments about getting membership from user-space and having them be potentially different from cluster filesytems also apply here.
Cluster Volume Managers
You might think that you really don't need a cluster-aware volume manager. Sometimes you might be right. More often, if you thought that, you'd be wrong. A cluster volume manager is just like a regular volume manager – only cluster aware. This is to keep different nodes from getting inconsistent views of the layout of a set of disks or volumes. The current cluster-aware volume managers are EVMS and CLVM. Only CLVM is expected to survive into the long term.
The big challenges for cluster volume managers are high-performance mirroring and snapshots. These operations are potentially very difficult to implement right and fast. Cluster-aware volume managers often have both kernel and user-space components. The membership inconsistency issues here are similar to those for cluster filesystems and the DLM.
CRM - Cluster Resource Manager
Every HA cluster has something like a CRM, but they may divide up these functions differently. Our CRM is a policy-based decision maker for what should run where – handling failed services and failed cluster nodes.
The CRM is similar to UNIX/Linux startup init scripts – it starts everything up – but across a cluster following some policies, and managing failures.
The Linux-HA CRM is arguably the best cluster resource manager around today – at least in terms of flexibility and power. It has usability issues, and can be extended, but those are solvable.
The Linux-HA CRM function is largely divided between the PE and TE – which are described below.
PE - Policy Engine
The Policy Engine is a key component of the CRM and does two distinct things.
It determines what should run where (cluster layout)
It creates a graph of actions of how to get from the current state of affairs to the new desired state
This graph of actions is then given to the TE (described below).
The system would have more flexibility if he PE were split into two parts for these two functions, and supported plugins for the cluster layout function.
It currently isn't aware of resource cost, nor of absolute resource limits and load balancing considerations, which complicate optimal placement. Those would be good things to add to it in the future. Having plugins for doing resource placement would also be a highly useful and desirable thing.
TE - Transition Engine
Receives a graph of actions to perform from the policy engine, then uses the LRM proxy to communicate with the LRMs to carry out the actions
Its main jobs are action sequencing, error detection and reporting
CIB - Cluster Information Base
The CIB manages information on cluster configuration and current status. The cluster configuration includes the configuration and policies as defined by the system administrator.
Its key difficulty is to keep a consistent copy replicated across the cluster, resolving potential version differences.
All the data it manages is XML, and the CIB has a minimal knowledge of the structure of this XML.
LRM - Local Resource Manager
In the Linux-HA architecture, a local resource manager runs on every machine and carries out the tasks given to it. Everything that gets done gets carried out by the LRM. Examples are:
start this resource
stop this resource
monitor this resource
migrate this resource
etc.
The LRM provides interface matching to the various kinds of resources through Resource Agents. The Linux-HA LRM supports several classes of Resource Agents.
The LRM is not at all cluster-aware. It can support an arbitrary number of clients, one of which is the LRM communications proxy (below).
LRM Communications Proxy
The LRM proxy communicates between the CRM and the LRMs on all the various machines. This function is currently built into the CRM. This architectural decision was based on expedience more than anything else.
To support larger clusters this needs to be separated out, made more scalable, and more flexible. This would allow a large number of LRMs to be supported by a small number of LRM proxies. In large systems, this would probably use the ClusterIP capability to provide load distribution (leveling) across multiple LRM proxies.
Init - Initialization and recovery
This code does really three things:
Sequences the startup of the cluster components
Recovers from component failures (restart or reboot)
Sequences the shutdown of all the various cluster components
This is currently provided by Linux-HA and bundled with the Linux-HA communications code. This likely needs to be separated out to a separate proxy function (process) in the future.
Infrastructure
The Linux-HA infrastructure libraries (“clplumbing”) does a wide variety of things. A few samples include:
Inter-Process Communication
Process management
Event management and scheduling
Many other miscellaneous functions
Surprisingly, these libraries amount to about 20K lines of code.
Quorum Daemon
The quorum daemon is an unusual daemon, because it's the only daemon we have that's intended to run outside the cluster proper. It is instrumental in solving certain knotty quorum problems – especially for:
2-node clusters (very common)
Split-site (disaster recovery) clusters
This was discussed extensively in previous postings [4] and [5].
Management Daemon
Provides Complete Authenticated Configuration and Status API. This includes both information contained in the CIB, and also information about the communications configuration and so on.
The management daemon is used by the:
GUI
SNMP agent
CIM agent
Clients are authenticated using PAM, and all communications is via SSL, so its clients can safely be outside cluster, or even outside a firewall. This daemon should provide different levels of authorization depending on the authenticated user, and should log its actions in a format suitable for Sarbanes-Oxley (SOX) auditing purposes.
GUI
Provides a Graphical User Interface providing configuration and status information. It also supports creating and configuring the cluster. Note that at the present time, there are a number of useful cluster configurations it cannot create.
CIM and SNMP agents
The CIM and SNMP agents provide CIM and SNMP management interfaces for systems management tools. The CIM interface supports status updates and configuration changes, whereas the SNMP interfaces only report status.
Disadvantages of this architecture
For a
variety of reasons, kernel space doesn't have access to user-space
cluster communications or membership.
As a result, both the DLM and most cluster filesytems implements their own membership and communications.
This is in contradiction to the “ideal world” statements earlier. This can result in some odd cases where one communication method is working in a particular case, but another method is not. This results in differences in membership – which can have bad effects.
Why this might not be quite as bad as it seems
One reason why one might not worry about this as much as one might, is because it's a problem which one can't make go away. A cluster system will always have to interface with software packages which do their own communication, and compute their own membership for a variety of usually good reasons. As a result, this is a problem which we can't make go away. Instead we have to deal with it effectively. There are basically two cases to consider:
The “Main” membership thinks that node X should not be in the cluster, whereas the “Other” membership thinks it should be.
The “Other” membership thinks that node X should not be in the cluster, whereas the “Main” membership thinks it should be.
Let's take these two cases one at a time:
Case 1:
If the main membership thinks node X is not in the cluster, then it will simply not start any resources on node X. This takes care of the problem.
Case 2:
If the “Other” membership discovers that a particular node should be dropped from its view of membership, and it can inform the CRM not to start its resources on that machine, then the local view of this membership from the perspective of the resources it deals with is effectively made to exclude these Other-errant nodes. In the Linux-HA CRM this is easily done having the Other-resources write node attributes to cause those nodes to be excluded, and the rules would then be written to exclude those nodes from consideration for running Other-related resources.
Although Case 2 isn't pretty, it works, and no amount of wishing and hoping is likely to ever make this kind of problem go away in the general case - particularly when one involves proprietary applications So, even if there is some membership discrepancy, it can is always possible to manage it appropriately assuming you can get a tiny bit of cooperation from the application.
References
[1]
http://linux-ha.org/
[2]
http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
[3]
http://linux-ha.org/STONITH
[4]
http://techthoughts.typepad.com/managing_computers/2007/10/more-about-quor.html
[5]
http://techthoughts.typepad.com/managing_computers/2007/11/quorum-server-i.html
A Complete Cluster Stack for Linux
Recently, I've had some folks ask me offline what exactly would a “complete” Linux cluster stack look like. That's a good question, and this posting is intended to address that question.
So let's start with – what kind of cluster? For the purposes of this posting, I'm primarily talking about a full-function high-availability enterprise-style cluster, not primarily a load balancing cluster, and not a high-performance scientific (Beowulf-style) cluster.
A few caveats before proceeding – much of what I'll reference below will be relative to the Linux-HA[1] framework, but the concepts are easily translated to any other clustering framework one might have in mind.
It's also worth noting that not every application, nor every configuration needs every component. Adding unnecessary components adds complexity, and complexity is the enemy of reliability.
Many of the components (cluster filesystem, DLM) are primarily needed by cluster-aware applications. Note that at this time (early 2008) very few applications are cluster-aware.
The Full Cluster Stack Exposed!
Below is a picture of the full cluster stack – which I'll describe in more detail later. For the most part, the components higher up in the picture build on the components lower down in the picture. To simplify the drawing, I didn't add all the who-uses-whom lines that one might want to make a detailed study of this subject.
Cluster Comm - Intracluster communications
The most
basic component any cluster needs is intracluster communications.
There are a variety of different possibilities, but guaranteed packet
delivery is a requirement. Linux-HA has its own custom comm layer
for doing this. It's not perfect, but it works. At one time in the
past, we provided support for the AIS cluster APIs, and if you use
OpenAIS today, then you can still have a reasonable cluster using
Linux-HA and providing compatible support for the AIS protocols. As
will become clear, it's not a perfect configuration, but it's a
reasonable one. (Of course, like everything, it can always be improved even more)
Very large clusters (hundreds to tens of thousands of nodes) will likely require a different communication protocol, since most guaranteed delivery multicast protocols don't scale that high.
Nevertheless, in an ideal world, all cluster components and cluster-aware applications would sit on top of the same set of communications protocols.
Membership – who's in the cluster?
Looking to the right of the cluster comm box on our architecture chart, you'll see the membership box. The next basic function that a cluster has to provide is membership services. Membership closely related to communication – since a simplistic view of membership is just who we can communicate with. It is highly desirable that everyone in the cluster be able to communicate with everyone else. It's the job of the membership layer to provide this information to the cluster.
When your communication fails in weird ways, it's the job of the membership layer to present a view of the cluster that makes sense – in spite of the weird kinds of failures that might be going on.
If we eventually wind up with multiple kinds of communications methods, then we'll also have multiple ways of becoming a member.
Linux-HA (with or without OpenAIS) supports the AIS membership APIs.
Like I
mentioned for communication, in an ideal world, all cluster
components would use exactly the same membership information. However, it is important to note that the membership one uses must be computed using the communication method being used by the application. So, unless every cluster-aware application uses both the common communication method and the common membership, it risks getting its membership out of sync with respect to its communication and other components using other communication methods. In many cases, this can't be avoided. Methods for coping with such discrepencies are discussed in more detail at the end of this post.
Fencing
Fencing is the ability to “disable” nodes not currently in our membership without their cooperation Many of you will remember having discussed this in some detail in an earlier post[2]. As I explained in more detail there, fencing is vital to ensure safe cluster operation
Our current implementation is STONITH[3]-based - STONITH == Shoot The Other Node In The Head
Quorum
Quorum is a topic we've talked about extensively in a couple of earlier posts [2] [4]. Quorum encapsulates the ability to determine whether the cluster can continue to operate safely or not
Quorum is tied closely to both fencing and membership. In practice, as we discussed before, it is often highly desirable to implement multiple types of quorum. Linux-HA currently provides multiple implementations and can provide more through plugins. Like membership and communication, it is desirable for all cluster components to use the same quorum mechanism. All the interesting and legal ways that quorum can interact with fencing and membership and the communication layer are too detailed for this posting.
Cluster Filesystems
Cluster filesystems allow multiple machines to sanely mount the same filesystem at the same time. This is a great boon to parallel applications. Cluster filesystems typically don't use the network or another server involved when doing bulk I/O. Each node mounting the filesystem is normally expected to have access to the data. This typically requires a SAN.
Typically, this is done to improve performance, but convenience and manageability are common secondary goals Cluster filesystems are related to, but distinct from, network filesystems like NFS and CIFS. On Linux, there are several cluster filesystems available including
Red Hat's GFS
Oracle's OCFS2
IBM's GPFS
etc.
Normally, when they're being used for performance reasons, cluster-aware applications are required. You can't typically just run 'n' copies of your favorite cluster-ignorant application and have it work. The filesystem won't scramble the data, but your application typically will. It goes without saying that high-performance cluster filesystems run in the kernel – unlike all the other items we've talked about before. Because of the high-performance, it's common for a cluster filesystem to have its own communication and membership code – not using the typically userland communications and membership code. Since membership isn't high bandwidth or really low latency information, it is possible to feed membership from a user-space membership layer into the kernel. Of course, then the membership and the communications layer are out of sync. It is arguable whether this is an improvement or not.
Cluster filesystems typically need cluster lock managers – described in the next section.
DLM - Distributed Lock Manager
A DLM (Distributed Lock Manager) provides locking services across the cluster, and it's an interesting piece of code to implement them – particularly the error recovery.
To some degree, DLMs are analogous to System V semaphores but – cluster-aware. In addition, they provide much more sophisticated API and semantics. Although DLM APIs are fairly well understood, there is no formal standard, so switching from one to another can be annoying. Red Hat has a reasonable kernel-based DLM which they use with GFS. DLMs commonly have their own separate communications and membership code. The comments about getting membership from user-space and having them be potentially different from cluster filesytems also apply here.
Cluster Volume Managers
You might think that you really don't need a cluster-aware volume manager. Sometimes you might be right. More often, if you thought that, you'd be wrong. A cluster volume manager is just like a regular volume manager – only cluster aware. This is to keep different nodes from getting inconsistent views of the layout of a set of disks or volumes. The current cluster-aware volume managers are EVMS and CLVM. Only CLVM is expected to survive into the long term.
The big challenges for cluster volume managers are high-performance mirroring and snapshots. These operations are potentially very difficult to implement right and fast. Cluster-aware volume managers often have both kernel and user-space components. The membership inconsistency issues here are similar to those for cluster filesystems and the DLM.
CRM - Cluster Resource Manager
Every HA cluster has something like a CRM, but they may divide up these functions differently. Our CRM is a policy-based decision maker for what should run where – handling failed services and failed cluster nodes.
The CRM is similar to UNIX/Linux startup init scripts – it starts everything up – but across a cluster following some policies, and managing failures.
The Linux-HA CRM is arguably the best cluster resource manager around today – at least in terms of flexibility and power. It has usability issues, and can be extended, but those are solvable.
The Linux-HA CRM function is largely divided between the PE and TE – which are described below.
PE - Policy Engine
The Policy Engine is a key component of the CRM and does two distinct things.
It determines what should run where (cluster layout)
It creates a graph of actions of how to get from the current state of affairs to the new desired state
This graph of actions is then given to the TE (described below).
The system would have more flexibility if he PE were split into two parts for these two functions, and supported plugins for the cluster layout function.
It currently isn't aware of resource cost, nor of absolute resource limits and load balancing considerations, which complicate optimal placement. Those would be good things to add to it in the future. Having plugins for doing resource placement would also be a highly useful and desirable thing.
TE - Transition Engine
Receives a graph of actions to perform from the policy engine, then uses the LRM proxy to communicate with the LRMs to carry out the actions
Its main jobs are action sequencing, error detection and reporting
CIB - Cluster Information Base
The CIB manages information on cluster configuration and current status. The cluster configuration includes the configuration and policies as defined by the system administrator.
Its key difficulty is to keep a consistent copy replicated across the cluster, resolving potential version differences.
All the data it manages is XML, and the CIB has a minimal knowledge of the structure of this XML.
LRM - Local Resource Manager
In the Linux-HA architecture, a local resource manager runs on every machine and carries out the tasks given to it. Everything that gets done gets carried out by the LRM. Examples are:
start this resource
stop this resource
monitor this resource
migrate this resource
etc.
The LRM provides interface matching to the various kinds of resources through Resource Agents. The Linux-HA LRM supports several classes of Resource Agents.
The LRM is not at all cluster-aware. It can support an arbitrary number of clients, one of which is the LRM communications proxy (below).
LRM Communications Proxy
The LRM proxy communicates between the CRM and the LRMs on all the various machines. This function is currently built into the CRM. This architectural decision was based on expedience more than anything else.
To support larger clusters this needs to be separated out, made more scalable, and more flexible. This would allow a large number of LRMs to be supported by a small number of LRM proxies. In large systems, this would probably use the ClusterIP capability to provide load distribution (leveling) across multiple LRM proxies.
Init - Initialization and recovery
This code does really three things:
Sequences the startup of the cluster components
Recovers from component failures (restart or reboot)
Sequences the shutdown of all the various cluster components
This is currently provided by Linux-HA and bundled with the Linux-HA communications code. This likely needs to be separated out to a separate proxy function (process) in the future.
Infrastructure
The Linux-HA infrastructure libraries (“clplumbing”) does a wide variety of things. A few samples include:
Inter-Process Communication
Process management
Event management and scheduling
Many other miscellaneous functions
Surprisingly, these libraries amount to about 20K lines of code.
Quorum Daemon
The quorum daemon is an unusual daemon, because it's the only daemon we have that's intended to run outside the cluster proper. It is instrumental in solving certain knotty quorum problems – especially for:
2-node clusters (very common)
Split-site (disaster recovery) clusters
This was discussed extensively in previous postings [4] and [5].
Management Daemon
Provides Complete Authenticated Configuration and Status API. This includes both information contained in the CIB, and also information about the communications configuration and so on.
The management daemon is used by the:
GUI
SNMP agent
CIM agent
Clients are authenticated using PAM, and all communications is via SSL, so its clients can safely be outside cluster, or even outside a firewall. This daemon should provide different levels of authorization depending on the authenticated user, and should log its actions in a format suitable for Sarbanes-Oxley (SOX) auditing purposes.
GUI
Provides a Graphical User Interface providing configuration and status information. It also supports creating and configuring the cluster. Note that at the present time, there are a number of useful cluster configurations it cannot create.
CIM and SNMP agents
The CIM and SNMP agents provide CIM and SNMP management interfaces for systems management tools. The CIM interface supports status updates and configuration changes, whereas the SNMP interfaces only report status.
Disadvantages of this architecture
For a
variety of reasons, kernel space doesn't have access to user-space
cluster communications or membership.
As a result, both the DLM and most cluster filesytems implements their own membership and communications.
This is in contradiction to the “ideal world” statements earlier. This can result in some odd cases where one communication method is working in a particular case, but another method is not. This results in differences in membership – which can have bad effects.
Why this might not be quite as bad as it seems
One reason why one might not worry about this as much as one might, is because it's a problem which one can't make go away. A cluster system will always have to interface with software packages which do their own communication, and compute their own membership for a variety of usually good reasons. As a result, this is a problem which we can't make go away. Instead we have to deal with it effectively. There are basically two cases to consider:
The “Main” membership thinks that node X should not be in the cluster, whereas the “Other” membership thinks it should be.
The “Other” membership thinks that node X should not be in the cluster, whereas the “Main” membership thinks it should be.
Let's take these two cases one at a time:
Case 1:
If the main membership thinks node X is not in the cluster, then it will simply not start any resources on node X. This takes care of the problem.
Case 2:
If the “Other” membership discovers that a particular node should be dropped from its view of membership, and it can inform the CRM not to start its resources on that machine, then the local view of this membership from the perspective of the resources it deals with is effectively made to exclude these Other-errant nodes. In the Linux-HA CRM this is easily done having the Other-resources write node attributes to cause those nodes to be excluded, and the rules would then be written to exclude those nodes from consideration for running Other-related resources.
Although Case 2 isn't pretty, it works, and no amount of wishing and hoping is likely to ever make this kind of problem go away in the general case - particularly when one involves proprietary applications So, even if there is some membership discrepancy, it can is always possible to manage it appropriately assuming you can get a tiny bit of cooperation from the application.
References
[1]
http://linux-ha.org/
[2]
http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
[3]
http://linux-ha.org/STONITH
[4]
http://techthoughts.typepad.com/managing_computers/2007/10/more-about-quor.html
[5]
http://techthoughts.typepad.com/managing_computers/2007/11/quorum-server-i.html
Quorum Server Illustrated – updated
In two earlier posts [1] [2], I gave brief descriptions of the quorum server which seem to have left as much confusion as they provided clarity. This post is only about the Linux-HA quorum server, and includes illustrations for clarity.
The Linux-HA Quorum API
In the Linux-HA quorum API, you can configure a number of quorum modules which are used as follows. If a quorum module returns HAVEQUORUM, then the cluster has quorum. If it returns NOQUORUM then the cluster does not have quorum. If a quorum module returns QUORUMTIE, then the next quorum module in the list is consulted. If the final module returns QUORUMTIE, then it is treated as a NOQUORUM event.
The quorum daemon is normally used in conjunction with the nomal arithmetic voting quorum module, so that it is only consulted when the number of nodes in the cluster is exactly half the number of configured modules in the system. So, it is worth noting that the quorum server will never be consulted if a cluster has an odd number of nodes.
Quorum Server Scenarios
Below, I'll go through the basic quorum server cases so you can see how all this works in more detail - with pictures, even!
Normal Situation - Everything up
In the picture above, everything is normal. The quorum server is up, and both sites are also up. Because the cluster has all its nodes up, the quorum server is irrelevant.
In the situation above, we show the "New Jersey" site as down. In this case, the conventional voting quorum has a tie (1/2 - exactly half of the nodes). In this case the quourm server is consulted. Since only New York is talking to the quorum server, the quorum server grants quorum to the New York site.
In the case above, the link between the sites has been lost, but both sites and the quorum server are all up. In this case, both New York and New Jersey contact the quorum server because each sees 1/2 nodes as being up - resulting in a tie condition.
In this case, the quorum server will choose one of the two sites to provide quorum to, and I assume in this case that New York was chosen. Because New Jersey wasn't granted quorum, it will shut its resources down.
What happens when the quorum server goes down?
That is the situation shown above. Because New York and New Jersey are both up, they have 2/2 votes and both provide service as they should. This illustrates the point that the quorum server is not a single point of failure.
Multiple Failures -> Loss of Service
In this final case, multiple failures have occurred - both New Jersey and the quorum server are down. In this case, New York doesn't have quorum, so it shuts down services and none are provide by any node in the cluster. Of course, this situation can be overridden in the cluster configuration by changing the quorum policy, but from an automated perspective, this is all that can be (should be) done.
Security Concerns
If you want to run your quorum server communications across networks which mig
Availability, MTBF, MTTR and other bedtime tales
If we let A represent availability, then the simplest formula for availability is:
A = Uptime/(Uptime + Downtime)
Of course, it's more interesting when you start looking at the things that influence uptime and downtime. The most common measures that can be used in this way are MTBF and MTTR.
MTBF is Mean Time Between Failures
MTTR is Mean Time To Repair
A = MTBF / (MTBF+MTTR)
One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.
That's exactly what HA clustering tries to do. It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can. Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service. If it's not observable by the client, then in some sense it didn't happen at all. This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.
It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other. Failure of one component in the system may not cause failure of the system. Indeed, good HA design eliminates single points of failure by introducing redundancy. If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.
MTBFx is Mean Time Between Failures for entity x
MTTRx is Mean Time To Repair for entity x
Ax is the Availability of entity x
Ax = MTBFx / (MTBFx+MTTRx)
In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely. So, why did I spend your time talking about it? That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.
Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.
Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri). With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2. If you compute the availability of the cluster, it then becomes:
A = Mi/2 / (Mi/2+Ri)
Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.
A = Mi/1000 / (Mi/1000+Ri)
If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.
A = 0/(0+Ri) = 0/Ri = 0
This makes it appear that adding cluster nodes decreases availability. Is this really true? Of course not! The mistake here is thinking that the service needed all those cluster nodes to make it go. If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct. But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question. Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.
To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails. Here are a few rules of thumb for thinking about availability
- Complexity is the enemy of reliability (MTTR). This can take many forms
- Complex software fails more often than simple software
- Complex hardware fails more often than simple hardware
- Software dependencies usually mean that if any component fails, the whole service fails
- Configuration complexity lowers the chances of the configuration being correct
- Complexity drastically increases the possibility of human error
- What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
- Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR. Replication is another word for redundancy.
- Good failure detection is vital - HA and other autonomic software can only recover from failures it detects. Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR. They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal. In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
- Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability. These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software. More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.
The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.



