[15 min read]
Explaining the routes to a proper design of Ceph clusters, this article continues a series of examples for collocation with respect to the failure domains. This is reflecting the recent versions of Red Hat Ceph Storage up to version 6. Newer versions of Ceph might slightly change the rules and will need double checking of possible enrichments of placement options.
The example of Case F (from Case B but across 3 domains)
In this example the setup to explore is based on the case B from “Collocation and number of nodes in a Red Hat Ceph cluster” – with the difference that we now would have 3 failure domains. Also here, the simple formula for Sum_nodes_min from the initial article cannot be used. The task is here to find a way of minimizing the number of nodes required but also again to properly distribute the roles for redundancy.
To recap, the list of roles and configurations is as follows:
- 5 MONs,
- CephFS file system 1 with 1 active and 1 passive MDS,
- CephFS file system 2 with 2 active and 1 passive MDS,
- 2 different sets of redundant iSCSI gateways,
- 2 zones for RGW realm One with 2 instances each,
- 3 zones for multi-site replicated RGW realm – two with different networks and each one 4 RGW for higher throughput.
- The dashboard should be hosted by a VM.
Explaining the requirements and dependencies
First task is to decide on the number of monitors: the requested 5 monitors cannot be evenly distributed across 3 failure domains: One out of the three available ones will only host one monitor while the other two will contain two monitor roles. This looks odd firsthand but technically it’s working: If one of the failure domains hosting 2 monitor roles becomes unavailable, the remainder of the monitors with 3 monitor roles can still form a proper quorum. If only the failure domain with the single monitor fails, there are sufficient monitor instances available for the quorum.
In contrast, if aiming for an equal distribution of the same number of monitor instances across the three failure domains, the next reasonable number will be 9 because the next one of 6 monitors would not be something being desired ? With 6 monitors being fairly unusual in an ordinary cluster, the failure conditions would match the requirements of a majority of deployed monitor roles to form a working quorum: 4 monitors out of 6 deployed, having 2 monitors in each of the surviving failure domains. To create a split environment which definitely should stop the access to the cluster could happen, if the failure scenario would be more complex. If the failure domains are perhaps a set of racks, those will not fail in a way that a whole rack is failing at the same time as a separation of the two remaining racks will happen. …. or, in another scenario, if one rack failed and another monitor inside any of the surviving racks would fail as well. In those cases, it would be considered not a single failure anymore but a set of independent failures that occurred at the same time or shortly one after another.
A set of only 3 failure domains wouldn’t provide a good environment for using Erasure Coding as the redundancy scheme for any data. The natural match in such a case would be the use of replica 3 or, if the reduced reliability is well understood, at minimum a replica 2.
The only special roles are the iSCSI gateways that would require a single node per instance. Since we need to plan separate nodes for those, this would be the starting point for finding the proper distribution.
The MDS instances required per file system can easily span two or three failure domains. Since there is no dependency on each other, for the distribution, the placement of the MDS can utilize any of the failure domains.
Assigning the RGW instances properly while completely fulfilling the requirements stated might become tricky. The first 2 zones for realm One with two instances each have no further requirements and the RGW instances can be distributed across simply different failure domains.
The RGW distribution for the 3 zones, each having 4 RGW instances for the requirement of a higher throughput, can be achieved in different ways: If the zones should use all available failure domains those would have to have one failure domain with 2 instances but 2 failure domains with one instance. This could also induce that some of the used nodes would need to have two client side networks while perhaps a few might need only one. For network separation for security, this could cause issues in the configuration design.
An alternate RGW configuration design could leverage the fact that there are only two zones that would need different networks. It’s not stated that the networks would be also different between the two zones but could be. There is some kind of fuzziness in the requirements and a refinement should be considered prior to the design.
Calculation for the failure domains
For placing the roles, different options are available and placing the first roles will then influence potentially the final solution. The way the assignment starts is not that important but should reflect the most difficult roles first.
Using 5 monitor instances, those could be distributed as in the table below:
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
Next will be the iSCSI gateways, since those will require individual nodes. Distributing the 4 instances in pairs across 3 failure domains might involve thinking about using not only 2 failure domains to provide still redundancy to at least one iSCSI gateway pair if a failure domain becomes unavailable. This would be a viable approach but might cause additional challenges: in one failure domain would be two instances of different pairs that might require different access networks which creates a special network configuration for this failure domain only. Instead, to avoid additional troubleshooting effort in case of a failure, the iSCSI gateways might be placed into 2 of the failure domains only, leaving the 3rd failure domain untouched for this. The latter will not require an access network to be configured for iSCSI while the other two will perhaps have two access networks. However, the network configuration of those two will be equal to each other while in the previous example distribution it will have all failure domains being different.
In my example configuration, I decide to go with the second option, choosing the second and the third failure domains as the ones to host the iSCSI gateways. At least, the second failure domain will already have a need for 2 nodes with monitors placed there already while the 3rd failure domain has actually only one node.
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
| iSCSI | 0 | 2 | 2 |
Moving over to the RGWs: Assuming that both the two zones of the mentioned multi-site configuration will require a common but different network as for the 3rd zone as the refinement to be made, two failure domains might have this network configured but also at least one of those might require to provide also the network access to the 3rd zone. In this case, also a mix of networks to at least one failure domain must be tolerated when going for redundancy.
With 4 instances to provide for redundancy and throughput, each pair can be spread across two failure domains. For 3 different zones, this will give 6 pairs of each two RGW instances.
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
| iSCSI | 0 | 2 | 2 |
| RGW | 2+2 | 2+2 | 2+2 |
The distribution of the other 2 zones of realm One will also require to distribute the pairs. This could involve performing a first calculation to understand the actual needed number of nodes.
For the indication, the example calculation for FD1 would be:
FD1: Sum_nodes_min = 0 + 0 +
max ( ROUNDUP(
( 2 + [2+2]
+ [0] + 0
) / 2),
max ( 2, (4), (0) )
)
= 0 + max ( ROUNDUP (6/2), max (2, 4, 0) )
= 0 + 4 = 4
but for the FD2 it will require even more nodes:
FD2: Sum_nodes_min = 2 + 0 +
max ( ROUNDUP(
( 2 + [2+2]
+ [0] + 0
) / 2),
max ( 2, (4), (0) )
)
= 2 + max ( ROUNDUP (6/2), max (2, 4, 0) )
= 2 + 4 = 6
Checking the actual assignment in those configurations is crucial for the following steps to keep the minimum of nodes for the distribution of the essential roles. However, no additional checks might be needed in cluster sizes where enough nodes will be available due to the desired capacity needed.
The next table summarizes the actual number of nodes in the interim solution:
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
| iSCSI | 0 | 2 | 2 |
| RGW | 2+2 | 2+2 | 2+2 |
| interim sum: | 4 | 6 | 6 |
With this actual need and the distribution of the RGW instances for redundancy while also reducing the need for additional networks across the failure domains, the assignment could be to the 1st and the 3rd failure domains:
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
| iSCSI | 0 | 2 | 2 |
| RGW | 2+2+2 | 2+2 | 2+2+2 |
| interim sum: | 6 | 6 | 8 |
Last assignment to be done are for the sets of MDS instances. Those are scale-out daemons with the effect from the collocation rules that they can be collocated with MONs or RGWs. Since 4 MDS need to be assigned, the distribution could use all failure domains adding at least one role per failure domain but 2 into any other, or to check for the best match based on the existing scale-out roles that need to be assigned already.
While from first look it might be obvious that the FD3 is already the one with the most nodes and should not accommodate additional roles, checking for the actual distribution of scale-out roles and possible empty “slots” is the better approach. In FD1, based on the RGW instances, already 6 nodes are required. However, based on the collocation rules, two nodes could accommodate another scale-out role, like a MDS, without adding more nodes. Also, FD2 has 2 nodes with one empty “slot”, but FD3 has 6 nodes with potential collocation and 5 of those having empty “slots”. This would then suggest assigning 2 MDS instances, one active and one passive to FD1, an active MDS to FD2, and the remaining 2 MDS instances to FD3. This way, each of the active ones can fail and in any of the other failure domains there will be a passive one to take over. Obviously, since there are two different file systems, the active ones for the second file system should be in FD1 and FD3 with the passive one in FD2.
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
| iSCSI | 0 | 2 | 2 |
| RGW | 2+2+2 | 2+2 | 2+2+2 |
| MDS | 2 | 1 | 2 |
| sum: | 6 | 6 | 8 |
Resulting configuration
This assignment results not in changing the number of nodes required and we can minimize the nodes for the picked configuration constraints. I selected to always go the structured approach distributing the same network connection needs across as few failure domains possible:
When selecting the stuffing of the nodes for the failure domains one needs to respect that the need for capacity might dictate a different number than the minimum number of nodes. In many configurations, the roles dictate the number of nodes rather than the number of media. In most cases, the designs end in an equal number of nodes across the failure domains.
In another approach with ignoring the challenges of mixing the network access paths in the failure domains, the distribution of the roles could further reduce the number of nodes required which might not be needed if enough nodes would be required for providing the capacity.
| role type | FD1 | FD2 | FD3 |
| MON | 2 | 2 | 1 |
| iSCSI | 0 | 2 | 2 |
| RGW | 2+2+1 | 2+2+1+1 | 2+2+1 |
| MDS | 2 | 1 | 2 |
| sum: | 5 | 8 | 7 |
This would reduce the number of needed nodes by 2 nodes. For a smaller cluster configuration, this is a desirable outcome. Also, in smaller environments, separating all the network paths into the failure domains might not be that important, however, it may be.
In this article I covered a more complex configuration with replication in the scope. Because of the more complex dependencies to respect, the calculation should be based on an interactive process to find potential savings. It’s important to understand additional infrastructure requirements and maintenance efforts. The 3 failure domains can be understood as 3 different racks in a data center or those could be three different DCs.
As an initially fairly clear configuration, it revealed some degree of freedom but also complexity, and other configurations will be discussed in the coming articles. Thanks for reading so far and stay tuned!

One response to “Collocation example – A complex example with 3 failure domains and replication of 3”
[…] the environment compared to the previous one in “Collocation example – A complex example with 3 failure domains and replication of 3” using 3 failure domains to only 2 failure domains is the way to explore how most European […]