Collocation examples – Simple one: 3 failure domains – replication of 3


[9 min read]

This article is a continuation from the foundational discussion about calculation of the number of nodes for collocation in Ceph started in the article “Collocation and number of nodes in a Red Hat Ceph cluster” and the advanced themes for respecting the failure domains in the subsequent articles. It’s now starting a series of examples provided to illustrate computation of the number of nodes required for a cluster with different configuration requirements and different failure domain structures. The cases are a continuation of the previous examples without respecting the failure domains. This first one is covering a simple one for showing the mechanisms.

The example of Case E

A simple cluster with 3 MONs, 2 MDS, 2 iSCSI gateways, but 2 different zones for RGW should be created, and the dashboard server should be installed on the cluster nodes. Because the two different RGW zones should be supported with redundant instances, we would need 4 instances, two for each zone. Each of the zones would represent a use case for us. This is exactly the example called Case A from “Collocation and number of nodes in a Red Hat Ceph cluster” – with the difference that we now would have 3 failure domains.

The simple formula for Sum_nodes_min from the initial article cannot be used as it is, since we need to distribute the roles for redundancy and would end up with a wild mix of roles and instances per top-level failure domain. Now, an intelligent distribution is needed to limit the number of nodes while maintaining the redundancy for the roles as well as the scale for perhaps additional role instances. I’ll do it separately for each individual role and explain the thinking about the options.

Explaining the requirements and dependencies

First, MONs and MGRs will stay together but need to be distributed across the 3 failure domains. Since I’ve got 3 MONs, each failure domain should get one to ensure that if one failure domain becomes unavailable, the remaining cluster can continue – based on a proper MON quorum.

The 2 iSCSI gateways are special and they’ll need an individual additional node per instance. To distribute 2 gateways, an option is to place each one gateway into a different failure domain out of the three available. I could test this out mathematically but in this smaller configuration, it might be obvious where the iSCSI gateways should go. Not having additional requirements for a locality, this gives full freedom for the placement and I could optimize it based on the other roles requirements.

There are 4 RGW instances to distribute. Since these are not only 4 instances but two different pairs for different use cases, no one instance of a pair should be placed together with the other one into the same failure domain. It would be ok to place two RGW instances from different pairs in the same failure domain, even on the same node.

The MDS instances should be placed into different failure domains.

Calculation for the failure domains

Now, I could start the calculation of the minimum number of nodes per failure domain. Since I have some degree of freedom for placing the scale-out roles, I need to pick a specific distribution for the example – based on the exclusive placement for iSCSI gateways:

FD1: one MON/MGR, one MDS, 1+1 RGW
FD2: one MON/MGR, one MDS, 1 RGW, dashboard, 1 iSCSI gateway
FD3: one MON/MGR, 1 RGW, 1 iSCSI gateway

The calculation per failure domain will be:

FD1: Sum_nodes_min = 0 + 0 +
                                          max ( ROUNDUP(
                                                  ( 1 + [1 + 1]
                                                    + [1] + 0
                                                  ) / 2),
                                                  max ( 1, (1 + 1), (1) )
                                            )
                             = 0 + max ( ROUNDUP (4/2), max (1, 2, 1) )
                                      = 0 + 2 = 2

FD2: Sum_nodes_min = 1 + 0 +
                                          max ( ROUNDUP(
                                                  ( 1 + [1]
                                                    + [1] + 1
                                                  ) / 2),
                                                  max ( 1, (1), (1) )
                                            )
                             = 1 + max ( ROUNDUP (4/2), max (1, 1, 1) )                                        
                             = 1 + 2 = 3

FD3: Sum_nodes_min = 1 + 0 +
                                          max ( ROUNDUP(
                                                  ( 1 + [1]
                                                    + [0] + 0
                                                  ) / 2),
                                                  max ( 1, (1), (0) )
                                            )
                             = 1 + max ( ROUNDUP (2/2), max (1, 1, 0) )                                       
                             = 1 + 1 = 2

Resulting configuration

I’ll need in this configuration 2 nodes for the first failure domain, 3 nodes for the second, and again 2 nodes for the 3rd. This would give me the following installation scheme:

I have a different number of nodes for the different failure domains. In the target configuration for the distribution of OSDs across nodes, this should be respected in the design. In most cases, using replica 3, it would be hard to distribute media across a different number of nodes because the slots might still be the same number across the chassis selected – whereas with a different number of slots this might not be a problem if the network bandwidth provided is sufficient. One would be also free to not use all the slots equally within the supplied nodes. For an initial cluster design, however, using similar nodes is recommended and thus I would choose to rather add nodes to have an equal distribution of the OSDs. Of course, it will cost more in terms of chassis, space, network ports, power & cooling – but will comply with supportability for smaller clusters. In addition, it will provide the room for replacement instances of the roles if a node fails in a certain failure domain and cannot be repaired in time. In such a case, a new instance could be started as a replacement on a node not being constrained already by the placement rules for collocation. Planning for it already right from the start should involve adding more resources, so CPU and memory, than needed for the actual roles designed, so that the goal to locate those replacement instances can be achieved. My final design, irrespective of the number of OSD needed for the capacity and performance, would look like:

As a variation of the latter design, the roles inside the failure domains could be placed differently. For instance, one of the RGW instances could go into node 3, and so on.

In the picture above, the nodes 3 and 9 can be used to start replacement instances for any of the roles, even of an iSCSI gateway instance. If I would have chosen to place the RGW instances onto nodes 3 and 9, only MON/MGR, MDS, and the dashboard instances could be started with replacement instances there but neither a rbd-mirror, nor a RGW, nor an iSCSI gateway instance.


This article as a first of a series of examples for collocation with respect to the failure domains was covering a fairly simple example but also one that could be easily compared with the initial examples for a calculation without failure domains in place. Additional, more complex examples and variations of those will follow. Thanks for reading so far and stay tuned!


Leave a Reply

Discover more from Data, Ceph and IT

Subscribe now to keep reading and get access to the full archive.

Continue reading