Converged Networking: Fabric Fault Tolerance and High Availability

Mike Lyons
7 min readSep 22, 2020

--

, FBCS, Distinguished Engineer and Master Inventor, Worldwide Client Technology Engagement, GTS Infrastructure Services, IBM

, Executive Architect, IS Offerings and CTO, GTS Delivery & Integrated Operations, IBM

Fabric Fault Tolerance and High Availability

The term “High Availability” is big part of any IT services conversation and unfortunately, it’s thought to be easy to achieve while in reality it is not well understood. The issue stems from the language. ”High” is both a relative and emotive term so consequently it is meaningless from an Architectural, Operational or Engineering viewpoint. When expecting to build a datacentre fabric and have it meet such a requirement, we need to be clear about the terminology and the specifications. Better language to use is the concept of “Fault Tolerance”.

Many sites are willing to absorb a small amount of downtime with high availability rather than pay the much higher cost of providing fault tolerance. The difference between fault tolerance and high availability, is this: A fault tolerant environment has no service interruption but a significantly higher cost, while a highly available environment has a minimal service interruption (High Availability versus Fault Tolerance, n.d.).

Since a key objective of a data centre is to provide continuous availability for the applications that it hosts it is important to understand the definition of availability and the operational assumptions that under pin it.

Figure 1: Applications A/A and A/P Deployment (Architecting Highly Available Cloud Solutions — IBM Garage Practices, n.d.)

Firstly, any complex system with multiple components will eventually fail if enough components fail and are not repaired. Preventative maintenance like fixing your spare tire. Hopefully before “I was meaning to get that spare tire fixed” as you sit by the road with a flat.

To design for High Availability is, in principle, to design for Fault Tolerance. Our hypothetical leaf-spine architecture uses the following design principles:

- The system will have no single point of failure (SPOF) which is to say a fault in a single component with not cause the system to fail.

- Any single component can be repaired and reactivated without disruption to the overall system.

- The repair time of a failed component is short enough so the likelihood of any other component failing while the first is repaired and reactivated is very low.

This last point is the critical one. In well designed complex systems it is possible to predict how the failure of any given single component will impact the system at large, but it becomes challenging to predict how any two or three might impact the system. If the number of components in the system is large the sheer number of combinations of possible compound failures becomes impractical to model.

Statistics play a big part in the method we use to design for fault tolerance. We need to select components for our system with a known probability of failure so we can plan how we use them and how we repair or replace them in the event they do fail.

Mean Time Between Failures and how it effects Availability

The reliability of the system is an aggregate of individual component MTBF (Mean time between failures) and MTTR (Mean time to repair) values. Critically the repair time has an operational element that assumes spare parts are close at hand and repairs are done without delay.

The availability of a system is often defined with a contractual Service Level Agreement or SLA such as 99.99% which is usually the proportion of uptime in a given time period, typically a month in most IT Services contracts. Presuming that no component outages occur in a system it will deliver 100% for that month. Systems with better than 99% uptime are considered fault tolerant. As the availability percentage approaches 100%, you move into the high availability networks. The closer you get to 100% percent uptime, the more expensive this availability gets (Planning for Network Availability, n.d.).

For a single component, availability can be measured as the total up time in a month divided by the total time in the month. If we assume that the down time was caused by time taken to repair a fault then the availability of the component is expressed as proportion is:

A=1-Tr/(Tu+Tr)

Where Tu is the up time in a month and Tr is the repair or down time.

While the actual failure of a single component is not possible to predict, we typically rely on the probability of failure from the statistical average of a large population from the vendor. This is usually published as the MTBF as it only relates to the component itself. IT devices such as switches typically have MTBF values of between 50,000 hours to 200,000 hours. In the case of some of the leaf and spine switches from Arista and Cisco they publish numbers as high as 500,000 hours.

If we assume a reasonable time to repair a given component for example 24 hours then over the likely life of the component the average availability can be predicted to be:

A=1-MTTR/(MTBF+MTTR)

If we take the case where MTBF is 100,000 hours and MTTR is 24 hours then we get over the life of the component an average availability of:

A=1–24/(100,000+24) = 99.976%

However, since the SLA’s are measured monthly (Am) a failure that results in a 24 hour outage of the system in 720 hours would only achieve 96.7% for that month.

Am = (30x24–24)/(30x24) = 0.967 = 96.7%

In the leaf and spine architecture we have been discussing, we avoid this problem by using redundancy at each level of the infrastructure so that any single component failure can be tolerated for the time taken to repair or more realistically replace it. More importantly the probability of a compound failure is kept very low when the replacement time is very small compared to the MTBF.

In Datacentre environments, most network equipment can be configured to automatically fail-over fast enough to be effectively non-disruptive, for example the four leaf spine simply continues to function even if one switch is completely powered off. This is the precise demonstration of the concept of fault tolerance.

Now assuming that this automatic redundancy is in place the availability of a pair of switches is given by:

A=1-((MTTR/(MTTR+MTBF))*(MTTR/(MTTR+MTBF)))

Here the availability is the 1 minus the very low likelihood of both devices failing at the same time.

So if we take the earlier example of MTBF of 100,000 hours and MTTR of 24 we get an overall availability of:

A=99.9999942%

While this availability figure is better than a single component the more important point is that the design can tolerate either component failing and the system keeps working while the repair is performed.

A system level availability is calculated by imagining the pathway through the system as a sequence of redundant functions blocks each with their own unique availability. Such a system has an overall availability of:

Asys=Aa*Ab*Ac*Ad

An important consideration is that the more sequential blocks in the system the lower the overall availability as each individual figure will be less than 1. (n.d. Reliability Modelling)

Now applied to leaf-spine architecture the path way from a server to the edge of the fabric is Leaf-Spine-Leaf-Edge the overall availability is:

Af=Ae*Al*As*Al

Assuming that every device has 100,000 hours MTBF and the same 24 hour replacement time this yields an overall availability of 99.9999827% and tolerance to the failure of any single Spine, Leaf or edge switch.

Figure 2 — Availability model for a 4 spine fabric

While this sound impressive and far exceeds the typical SLA of 99.99% availability if we change the replacement time parameter to 48 hours the result becomes 99.999309% and if it was 7 days the number comes down to 99.9991561%.

The critical takeaway from this is that the ability to promptly repair or replace a failed component is the most important factor in delivering “High Availability.” We need to have ready access to spare components and people available to replace them within our designed MTTR or we could find ourselves on the metaphorical side of the road with a flat tyre.

The longer we delay the repair the higher the likely hood we will have a compound failure and this is particularly true if the fault was the result of some environmental factor such power or cooling.

Conclusion

As we have shown here it is entirely possible to engineer a data centre fabric capable of meeting the contractual SLA’s of an IT services engagement, the operations team ultimately make the system highly available. In fact the most likely cause of an outage will be human factors or as I like to call them “fat finger errors” and this is where the true value of Software Defined automation comes in.

References

Architecting highly available cloud solutions — IBM Garage Practices. (n.d.). Retrieved September 21, 2020, from https://www.ibm.com/garage/method/practices/run/cloud-platform-for-ha

High availability versus fault tolerance. (n.d.). Retrieved September 21, 2020, from https://www.ibm.com/support/knowledgecenter/en/SSPHQG_7.2/concept/ha_concepts_fault.html

Planning for network availability. (n.d.). Retrieved September 21, 2020, from https://www.ibm.com/support/knowledgecenter/en/POWER5/iphae_p5/highavailability.html

Reliability Modelling. (n.d.) Retrieved September 22, 2020, from https://en.wikipedia.org/wiki/Reliability_engineering#Reliability_modeling

--

--

Mike Lyons

Mike is a Distinguished Engineer with Kyndryl and has a life long interest in the transport of information.