Converged Networking: OVER-Subscription Ratio
Mike Lyons Mike Lyons, IBM Executive Architect
Clea Zolotow, IBM Distinguished Engineer
Introduction
An important consideration in designing data centre networks is the over-subscription ratio. This is defined as the maximum amount of traffic a switch can receive from end points (i.e., servers directly attached to the switch), divided by the maximum amount of traffic it can send on the uplinks to the rest of the data centre network. In the case of a leaf-spine network, the leaf switch uplinks connect to each of the spine switches which is key to how we manage over subscription.
In three tier DC LAN environments a ratio of 3:1 was typical for the oversubscription ratio of a traditional access switch however this didn’t take into account of over subscription in upper tiers of the network. This is mainly because the number of switch hops between any two access ports in the data centre could vary considerably and so it was complex to model the traffic flows or calculate the total oversubscription.
As we have noted in an earlier blog the flatter topology of the Leaf-Spine means the number of hops through the network between any two Leaf switches is always the same so calculating the oversubscription ratio becomes much easier.
Over-Subscription Ratio Calculation
The oversubscription ratio can be calculated using the formula:
(Pn x Ps):(Un x Us)
Pn is the number of connected leaf ports and Ps is their speed of these ports and Un is the number of spine uplinks and Us is their speed.
As an example, imagine a Leaf switch with 48x10g ports for attaching endpoints giving 480Gb/s of port capacity. If this Leaf is connected to 4 spine switches at 40Gb/s it will have a total uplink capacity of 160Gb/s. So the oversubscription ratio is:
480:160 and dividing both sides by 160 we get 3:1.
The advantage in the flatter leaf-spine topology make this is both easier to calculate and manage. Since it is calculated separately for each switch it isn’t a static number across the fabric and with multiple spine switches it is possible to clearly predict the impact to the capacity of the network in the event of a spine switch failure.
The Scotch Bucket Theory
To understand why oversubscription is important picture a bucket with a hole in the bottom which you use to fill a series of bottles with a liquid that you value and don’t want to spill (no judgement). For each bottle you stop the flow of the liquid while you replace the bottle and so on. Now imagine this valuable liquid is being poured into the bucket from above from a source you can’t stop. If the rate of liquid pouring into the bucket is less than the average rate of the liquid being released into the bottles then you have little chance of the bucket overflowing unless you block the hole at the bottom for too long.
The bigger the bucket the longer you are able to block the hole. If the inflow rate is higher than the average outflow rate then eventually the bucket will fill up and spill your precious liquid (again no judgement). In this situation you can say the capacity of the filling process has been oversubscribed.
This exact analogy is described in queuing theory as the leaky bucket algorithm and it is the basis for just about every type of network Quality of Service (QoS)(Leaky Bucket Analogy — Leaky Bucket — Wikipedia, n.d.). This technique is central to managing network bottle necks where for instance a Local Area Network which is mostly 1Gb/s Ethernet is connected to a WAN service of say 100Mb/s (an example of a 10:1 oversubscription) so the site router needs to shape the traffic to the WAN by queuing (or buffering) it and sometimes discarding (or spilling) packets. If however the WAN circuit was 1Gb/s the router is quite likely to be able to forward all the traffic from the LAN with very little risk of dropping packets.
Oversubscription Buffering and Failure Scenarios
In the case of our Leaf-Spine topology it is preferable to try to keep this oversubscription ratio close to 1:1. This ensures that the probability of a Leaf switch not being able to buffer every packet it must forward is almost zero. This also works in the other direction so a ratio of 1:1 represents a symmetrical ability for the leaf switches to forward uplink and downlink traffic with no packet loss.
The oversubscription ratio during a spine failure scenario can be taken into account with N+1 capacity design. For example, the capacity of a 2 spine network drops by half during a single spine outage, but only drops by one quarter on a 4 switch spine. So, when designing the fabric we need to consider the following:
· The number of spine switches required depends on the overall bandwidth required by the fabric and the number of leaf switches, including during failures and repair time.
· The number of leaf switches depends on the number of servers and other devices connected to the fabric.
· The number of uplinks on a leaf switch to the spine, these are usually limited to 4 or 6 but at high speed such as 40 or 100Gb/s. This determines the maximum number of spine switches you could have in your fabric since leaf switches must connect to every spine switch (up to 4 in this discussion)
· The number of ports you have in your spine switches — This number determines the number of leaf switches you can have (up to 64 in the case of a fixed configuration spine with 64 ports)
· All spine switches must implement ECMP (Equal Cost Multi-Path) packet forwarding identically so that traffic is evenly distributed across the spine switches, this may require the same model-type and Software on each spine and is particularly important when replacing a spine switch.
· Leaf pairs should be the same device type however a variety of Leaf pair types can be used. Leaf switch models depend on the devices attached to them (i.e. Storage Leaf may require more buffering than compute however the way the storage is shared across the network plays a big part in this, but more on that another time)
The following table illustrates how we can design for a specific over-subscription ratio, ratios less than 1:1 are depicted as green.
As you can see from the table the oversubscription is closest to 1:1 when the total uplink and downlink capacities are matched. For example:
Cisco Nexus switches, such as the N9364C, which is a fixed configuration spine switch with 64 ports so each spine can host 64 leaf switches. If we choose a 48 port switch capable of 10g on each port and 100G uplinks to 4 spines we get an oversubscription of 1.2:1 however if we reserve 8 ports on each leaf (for management etc) and use only 40 for 10g traffic then we get exactly 1:1 for the production ports.
Conclusion
While working through the planning of a potential data-centre deployment it quickly becomes clear that maintaining 1:1 across the fabric is driven by the number of server ports we host on each leaf. Given the connection speed it doesn’t matter if we use a combination of 10, 25 or 40 Gb/s to connect the servers just as long as the total connected capacity doesn’t oversubscribe the uplinks to the spine on each leaf switch.
Another important conclusion you can draw is that the design of the network must closely coordinate with the planning for the compute resources and the network interfaces they have. This life-cycle planning becomes critical particularly when we plan to provide the storage over the Ethernet fabric rather than build a separate storage fabric. There are a lot of advantages to be gained by keeping all the of the technologies aligned, but more on that later…
References
leaf-spine-topology-construction-examples. (n.d.). Retrieved September 1, 2020, from http://www.fiberopticshare.com/leaf-spine-topology-construction-examples.html
Leaky Bucket Algorithm in Computer Networks — Webeduclick. (n.d.). Retrieved September 1, 2020, from https://webeduclick.com/leaky-bucket-algorithm-in-computer-networks/
Leaky bucket analogy — Leaky bucket — Wikipedia. (n.d.). Retrieved September 1, 2020, from https://en.wikipedia.org/wiki/Leaky_bucket#/media/File:Leaky_bucket_analogy.JPG
Oversubscription. (n.d.). Retrieved September 1, 2020, from https://docs.vmware.com/en/VMware-Validated-Design/4.0/com.vmware.vvd.sddc-design.doc/GUID-0ED17BA7-1A17-437E-B5B0-2F5E2DBD4327.html