[8 min read]
The previous article “Collocation and number of nodes in a Red Hat Ceph cluster” explained the ground rules of using collocation in a Ceph cluster with the plain assumption of having no additional failure domains to respect and to plan for. Whether using replication or Erasure Coding as the redundancy scheme wasn’t in the picture so far and this will be applied in the discussion now as well.
Top-Level and sub-level failure domains
The redundancy scheme of the data management in Ceph can spread data in a way that failure domains are respected. The description of the hierarchy of failure domains to respect is found in the official upstream documentation. The failure domains are called “buckets”, like “row” for a row of racks in an on-premises data center. Although pretty much all kinds of failure domains are implemented as buckets now, the names can be used for any other failure domain one might have in one’s data center – only the implemented hierarchy is important. The name of the buckets used doesn’t matter for reflecting the real world: if there is an air conditioning system central to a set of different rooms, the bucket “room” might not be sufficient but “datacenter” might be too wide. With this example, the use of bucket type “room” might be needed for a specific room, the “datacenter” could reflect the sets of rooms covered by the air conditioning system, but the “zone” might cover a real data center. Further in the explanations I continue to use the term failure domain but this could be any and sometimes I might explain some options to consider in special setups.
The failure domains could be used to reliably store the data. The scheme used, so replication or Erasure Coding, is essential: The minimum number of failure domains to consider for distributing the replicas should match the number of replicas considered for our data pools whereas the number of failure domains for Erasure Coding should be one plus per failure domain.
Replication is the easiest way
Placing the replicas is easy: the number of replicas we considered to use for a pool Placing the replicas is easy: replicas we considered to use for a pool should be stored into the different top-level failure domains. If we had 3 nodes and decided to use 3 replicas, any OSD within the node could be used to store the replica – the failure domain would be the node and if one node fails there would still be 2 additional replicas available.
If there would be 6 nodes in 3 different failure domains, additionally to the nodes itself, like in a configuration with 2 nodes per rack in 3 different racks, OSDs could become unavailable because of a media failure, a node failure, a network failure, or a rack failure – with different number of OSDs affected based on the exact nature of the impact: The network failure, if it’s limited to the rack, could be considered a rack failure whereas a network failure not limited to the rack nor affecting the rack directly would be a second failure that might affect not only this rack or the nodes inside this rack but another failure domain – being another rack or just only another node. Although a node in a rack can be redundantly connected it will still have the OS or the chassis as a single point of failure, but also the rack itself might impact all the nodes or part of the nodes if it fails, or just might be impacted by the actual infrastructure layout of power and cooling, etc. Those higher level structures other failure domains depend on should be recognized by the admin for properly defining a structure of failure domains to respect. Those that are on the highest level, further on I call “top-level failure domain”, but all others that might exist underneath, I call a “sub-level failure domain”. The drawing below might help to illustrate it a little bit:
With the focus on the top-level failure domains for distributing the replicas we can avoid impacts by losing more than one replica because of an incident with a failure domain structure. However, in some structures, top-level failure domains might not be all that we need to look at and sometimes it might turn out that a physical top-level domain might be overruled by a virtual top-level domain – for instance, if a specific sub-level domain might be the highest level allowed to use for a certain workload but not any of the sub-level domains of the same level within a given top-level failure domain:
In the above drawing, there are 3 racks. The first rack, Rack 1, might have nodes deployed that can all be used for the first replica. The top-level failure domain would be Rack 1. The second rack, Rack 2, has two sets of nodes, one set represented by Node 2 connected to network switch “net 2” and the other set represented by Node 3. Only the set of Node 2 should be used for our pool and has a different network switch connecting those to the cluster and hence the failure domain for this pool would be not Rack 2 but the set of Node 2. The last rack, Rack 3, has two sets of nodes, Node 4 and Node 5, but all the nodes could be used to place our 3rd replica and all are connected using the same switch. The 3rd top-level failure domain would be Rack 3 and then any of the node sets could take our replica but also would depend on the rack and the switch. The network switch in Rack 3 impacts the availability of the nodes inside the rack in a similar way and hence we can take the rack as the top-level failure domain. Although Rack 2 is a physical failure domain, this failure domain is further structured by the network switches and hence the sets of nodes belong to an additional physical failure domain but at a lower level. The failure domains to use for placing the 3 replicas properly and reliably would be “Rack 1”, “net 2”, and “Rack 3”. This example is a little bit artificial but illustrates the degrees of freedom to understand when defining the placement of roles for different use cases.
Distributing the replicas across more top-level failure domains can be done without any additional impact if the networking resources are sufficient. In addition, the failure domains might not necessarily be equally configured: One data center room might have only one rack whereas the two other available ones might have 2 or 3 racks and so on. The important part here would be to understand the possible failure scenarios and then carefully determine the top-level failure domains to use for crafting a proper data distribution for data redundancy.
The art of reliable Erasure Coding
When using Erasure Coding, the schemes supported with Red Hat Ceph Storage are k+m={4+2, 8+3, 8+4} with k representing a data chunk and m representing a coding chunk. With the scheme of (k+m)=(4+2) we end up with 6 different chunks. To read data sufficiently complete, at least the number of k chunks must be available, not necessarily being all data chunks. The data objects will not be stored as a whole object but will be split into k data chunks of nearly the same sized data portions. To properly write data objects in the Erasure Coding scheme, k+m chunks must be stored into different OSDs. Having the need of 4 chunks for a successful read, the absolute minimum number of nodes to distribute the chunks properly would be (4+2)/2=3. If one of the nodes failed, still 4 chunks would be available. However, if one node would fail and an OSD containing the other chunks on the remaining nodes would also fail, no sufficient number of chunks could be read and no data could be delivered. Considering that media might fail totally independent of other issues within the cluster, placing more than one chunk into the same node is strongly discouraged and not supported. Even more, if one thinks about larger clusters still not taking additional failure domains other then nodes into account, nodes can fail in a random scheme for different reasons and, therefore, all chunks should be written always: If a data object was successfully written to k+m nodes, in our example 6 nodes, at least one node and one media could fail for these chunks and we still would be able to read back the data properly. If we would have written only 5 chunks out of the full 6 chunks, but now a set of 2 additional nodes goes down and the previously failed one recovers, still we would have only 3 valid chunks available and wouldn’t be able to recover the additional missing chunks. We would need to repair at least one of the recently failed nodes to be able to read the data properly and to recreate the missing chunks. The minimum supported number of nodes for Erasure Coding is k+m+1 providing an extra node for still being able to always write all chunks in the case of a node failure – with the assumption that sufficient free capacity is available within the cluster to allow proper self-healing, too.
Using Erasure Coding across failure domains implies that the same rules can be followed even if a failure domain becomes unavailable. For clusters not being full and with enough headroom in the media, the number of top-level failure domains should be of k+m+1 as a minimum. With the assumption that all infrastructure would not impact the top-level failure domains differently, more top-level failure domains could be used. For a 4+2 Erasure Coding scheme it would require 7 different failure domains being equally independent from each other, as a minimum. Those could be 7 racks or 7 rows in a data center with the assumption that nothing else would impact the availability of the nodes from the different rows even if shared resources for power, cooling, etc. would be used. This is a possible scenario and doable in most larger environments, but finding 7 independent data centers with fully independent network connections between all might be difficult. In most environments, this would lead to using only one data center. An illustration of the requirements put into a single datacenter using racks shows this:
When considering a rather full cluster with nodes in different failure domains for the use with Erasure Coding, enough space should be available in all the nodes remaining after a node failure in this top-level failure domain to be able to always write all chunks. In addition, the number of nodes per top-level failure domain should be one more than required – in small configs to provide another node for covering a node failure and for larger configs for providing enough free capacity. For full consumption of the available space, if the nodes per top-level failure domain are more than 19 no additional nodes must be provided:
Although the explanation was until now only covering the top-level failure domains as the main focus of the proper distribution for data redundancy, sub-level failure domains must not limit the placement of the data in addition but allow the placement in any of the nodes even in different sub-level failure domains. If a sub-level failure domain must be used exclusively for a pool not allowed to store data into the other sub-level failure domains, this sub-level failure domain automatically becomes a top-level failure domain for this part of the failure domain structure.
For the OSDs, this might be easy now but for assessing the placement of the different other roles into nodes that always belong to a certain failure domain structure level, one needs to plan more carefully – and perhaps size differently for the different nodes.
In the article, I introduced different ideas to distribute data for resiliency across real failure domains. In the next article, I will continue to expand this resiliency idea for the additional roles in the cluster. Thanks for reading so far!
