Collocation example – A simple example with 4 failure domains and replication of 3


[9 min read]

Providing more failure domains for a distributed system might mitigate a lot of placement problems of specific roles for redundancy. In this article of a series of examples, I explore the way of determining the minimum number of nodes required when transferring a fairly well understood configuration into an environment with 4 failure domains.

The example of Case H (from Case B but across 4 failure domains)

With the same set of requirements, the actual environment to use has 4 equal failure domains. This might be a little bit uncommon but should illustrate other options if perhaps multiple equally rated failure domains, like racks or fire zones, are available to install the Ceph cluster. 

Again, the simple formula for Sum_nodes_min from the initial article “Collocation and number of nodes in a Red Hat Ceph cluster” cannot be used. However, I aim to use as few as possible nodes as in the previous example.

The list of roles and configurations is as follows:

  • 5 MONs, 
  • CephFS file system 1 with 1 active and 1 passive MDS, 
  • CephFS file system 2 with 2 active and 1 passive MDS, 
  • 2 different sets of redundant iSCSI gateways, 
  • 2 zones for RGW realm One with 2 instances each, 
  • 3 zones for multi-site replicated RGW realm – two with different networks and each one 4 RGW for higher throughput.
  • The dashboard should be hosted by a VM.

Explaining the requirements and dependencies

There are now 4 main failure domains within the same network perimeter. Any of those failure domains is connected in the same redundancy and bandwidth to the other ones, making the failure domains equal from the perspective of the impact of an outage to the cluster. 

Maintaining a proper quorum of the monitors is key also for a 4 failure domain structure. With a requirement of placing 5 monitor instances, there would be one failure domain that would receive 2 instances if the distribution should be done evenly. But there is another option: one could choose only 3 failure domains to use for placing the monitor instances. Using only 2 failure domains is not an option since we would need a tie-breaker monitor instance and this would in turn require that all OSDs must reside within the 2 failure domains – this would not use the 4 failure domains provided. The other options instead would both be valid: if any of the domains would fail, the reminder of the failure domains would at least have the majority of the deployed monitor instances and would provide a proper quorum for the monitor cluster.

The example focuses on using replica 3 for the data redundancy as stated in the title. Data would need to have any of the replica inside different failure domains, but since there are 4 failure domains, it could be 1-2-3, or 2-3-4, or 1-3-4, or even 1-2-4. All those placements are valid and would provide a fully sufficient number of replicas remaining in the case of an outage of one failure domain.

Placing the iSCSI gateways would require a single node per instance. For the redundancy, those should not reside within the same failure domain for both nodes of a pair.

The RGW and MDS instances should be spread across the main failure domains as well, keeping the same scheme of having the instances distributed across at least 2 failure domains for redundancy.

The RGW instances could be distributed using different approaches. The two instances per zone for the realm One should be placed each one into any of the failure domains. The 4 instances of the zone for the multi-site realm might be placed differently. The 4 failure domains might be used for providing 4 different access paths to the instances. One of the zones uses a different network, it’s being stated, whereas the other ones use different networks. This requires a refinement of the requirements since it could be seen as just another network for those two zones or a separate network for each of the zones. In this example configuration, I will focus on the latter one, at the end using two different networks for those 2 zones.

Calculation for the failure domains

Using 5 monitor instances as described above, those could be distributed as in the table below. The 5th monitor is not used as a tie-breaker in the actual design – all monitors would be equal. 

role typeFD1FD2FD3FD4
MON2111

iSCSI gateways will require individual nodes. Distributing the 4 instances across 4 failure domains is ending best with having one per failure domain.

role typeFD1FD2FD3FD4
MON2111
iSCSI1111

For the distribution of the 2*2 and 3*4 RGW instances I choose the following approach: because I assume that any of the separate networks might be available only in a subset of failure domains, I assume that those will use a pair of failure domains for the two zones requiring special networks. RGW instances for the first zone will go into FD1 and FD2 while the others will go into FD3 and FD4:

role typeFD1FD2FD3FD4
MON2111
iSCSI1111
RGW2222

The other RGW instances should be distributed according to the available free slots for collocation: 4 RGW instances for the 3rd zone in the multi-site realm across all failure domains, but choosing FD1 and FD2 for the first zone of realm One and FD3 and FD4 for the second zone.

role typeFD1FD2FD3FD4
MON2111
iSCSI1111
RGW2+1+12+1+12+1+12+1+1

For the distribution of the MDS, a similar scheme can be used. However, since we aim to have the minimum number of nodes deployed, we might check for the number of “free slots” for scale-out daemons. If there would be no “slot” available, adding a MDS per failure domain would result in adding as much as 4 nodes to the cluster. In the actual case, one can see that the number of scale-out roles is 6 in the first failure domain but only 5 in the other failure domains. One could judge the failure domains FD2 to FD4 could accommodate one MDS in the remaining “slot” but then any failure domain would need to add a node for accommodating the 4th and 5th MDS. This is not quite right, since the number of nodes required is mainly dominated by the number of RGW instances: all of the failure domains have the need for 4 nodes already for the RGW instances and accommodating the monitors is possible while also still providing room for one MDS. A resulting combination could look alike without requiring additional nodes:

role typeFD1FD2FD3FD4
MON2111
iSCSI1111
RGW2+1+12+1+12+1+12+1+1
MDS11+111

As all the instances are already placed, the calculation would be:

FD1: Sum_nodes_min = 1 + 0 +
                                          max ( ROUNDUP(
                                                  ( 2 + [2+1+1]
                                                    + [1] + 0
                                                  ) / 2),
                                                  max ( 2, (2+1+1), (1) )
                                            )
                            = 1 + max ( ROUNDUP (7/2), max (2, 4, 1) )
                            = 1 + 4 = 5

FD2: Sum_nodes_min = 1 + 0 +
                                          max ( ROUNDUP(
                                                  ( 1 + [2+1+1]
                                                    + [1+1] + 0
                                                  ) / 2),
                                                  max ( 2, (2+1+1), (1+1) )
                                            )
                            = 1 + max ( ROUNDUP (7/2), max (2, 4, 2) )
                            = 1 + 4 = 5

and for FD3 and FD4:

FD3-4: Sum_nodes_min = 1 + 0 +
                                          max ( ROUNDUP(
                                                  ( 1 + [2+1+1]
                                                    + [1] + 0
                                                  ) / 2),
                                                  max ( 1, (2+1+1), (1) )
                                            )
                            = 1 + max ( ROUNDUP (6/2), max (1, 4, 1) )
                            = 1 + 4 = 5

In this case, all the failure domains would have an equal number of nodes and the following configuration could be used:

role typeFD1FD2FD3FD4
MON2111
iSCSI1111
RGW2+1+12+1+12+1+12+1+1
MDS11+111
sum: 5555

Resulting configuration

The minimum number of nodes per failure domain is mainly dictated by the number of RGW required, because they need to run with one instance per node. Also here, If the environment could provide VMs for this task instead, the number of required nodes per main failure domain would come down to only 3 nodes in FD1 and FD2 and 2 nodes for FD3 and FD4.


The more failure domains provided in this article interestingly did not reduce the number of nodes because of the dependency on spreading the RGW instances that are numerous. But like in the previous example, the number of required hardware nodes can be dramatically reduced if RGW instances could be provided based on a perhaps existing virtualization solution. 

In the next articles, the examples will introduce Erasure Coding as one of the data redundancy schemes that are frequently used for RGW use cases. Thanks for reading so far and stay tuned!


Leave a Reply

Discover more from Data, Ceph and IT

Subscribe now to keep reading and get access to the full archive.

Continue reading