Advanced Collocation – Placement of the additional Roles


[9 min read]

The previous article “Advanced Collocation – Data Placement Constraints” carried the discussion of the collocation of roles in Ceph clusters forward to the first additional constraint: the data placement using either replication or Erasure Coding as the redundancy schemes. I introduced the additional failure domains to respect, and in this article now, the additional roles, like MON, MDS, etc., will be covered with special considerations for things to plan for the placement selection.

Scaling the roles across a structure of nodes

With the scale of the nodes, more room would be available to place the additional roles onto some nodes in a collocated manner. With not only the scale of the cluster limiting the placement of the additional roles but also the failure domains impacting the placement, this additional element must be considered to determine a viable design.

Failure domains might have a simple structure with all the nodes contained being equal. This would be the easiest way to determine the possible placement across the failure domains and still maintain the rules of collocation. In this case, simply enough role instances should be distributed across the different failure domains to provide the required redundancy and then one can apply the rules for calculating the required number of nodes for collocation as explained in the previous article “Collocation and number of nodes in a Red Hat Ceph cluster“.

The exact number of instances one would choose in the real world, however, would depend on the structure of the cluster and perhaps the bandwidth required: having two failure domains, at least 1 instance per failure domain would make sense. An additional instance per failure domain would add more bandwidth and could help in cases when the instances of one failure domain are unavailable. This could be a requirement for RGW when a similar bandwidth must be maintained even with a data center failure. Additional considerations might come into play: Placing 3 instances properly into two failure domains might be challenging but is possible: if one failure domain is the main path for the workloads, having at least one additional instance in the other failure domain would provide the access to the data. However, the bandwidth provided to the clients might be limited when switch over to the latter instance is needed.

In the drawing above, the left side illustrates the placement with plain scale. However, with a limited number of nodes, even in this approach the dashboard instance already prevents from having an equal number of RGW instances in each of the failure domains. On the right side, a main/secondary structure is shown. If the main DC fails, the cluster wouldn’t be operable, the data based on replica 3 would be preserved, but the RGW instance is almost useless because no traffic will be served because of the missing quorum for the monitors. Not all the placements might work out well and understanding the additional dependencies for planning is required. Note that also the dashboard is missing in the right side of the drawing. This could be placed somewhere external to the cluster itself. This might have been also a strategy for the left side of the drawing to achieve the same number of RGW instances on all the nodes.

Distributing a limited number of instances per role

In contrast, there might be only a limited number of instances supported per cluster: The number of volumes presented at any iSCSI gateway instance is limited and carefully watching the number of volumes would be required. 3 iSCSI gateways would make perfect sense if there are at least 3 failure domains but only 2 or 4 gateways per cluster are officially supported. In this example, one could only go either with 2 iSCSI gateways and thus limiting further the number of volumes, or would need to pick one failure domain for 2 gateways whereas planning only one iSCSI gateway for the other two failure domains:

Also placing MDS instances might involve planning the location and number of gateways based on their roles: There could be multiple MDS serving the same use case and file system and those might involve active and passive MDS. Passive MDS can take over the workload from an active MDS in case of a failure while an active MDS cannot work as a stand-in. Placing additional and enough passive MDS might be desired. In addition, there might be an additional structure based on different ranks that apply to the MDS instances. In all cases, the failure domains give the structure also here and might drive the need for additional instances.

In the drawing below, there might be a need for only two active MDS. To provide a failover for all active MDS, also two passive MDS would be required. The two active MDS are placed into failure domain 1 and 2 passive MDS distributed across failure domains 2 and 3. In case of an outage of failure domain 1, the two active MDS could be covered by the two passive MDS, illustrated by the MDS with the gray background. The placement of the passive MDS could be inside the same failure domain as one of the active MDS, both passive MDS in failure domain 2 or 3, but never more than two instances in any failure domain.

Sub-level failure domains and access locality

With additional sub-level failure domains to respect in the structure, the decisions might deviate further: If an unlimited number of instances would be possible which is not the case in any scale, all instances for a given use case could be planned into all the sub-level failure domains. This might be limited then again by the number of nodes required and hence might not be desired.

The case with a limited number of possible instances per role was explained before but also in the sub-level failure domains one needs to select the placement of the instances while understanding possible impacts to availability if a sub-level or top-level domain becomes unavailable. A risk assessment based on the given failure domain structure would be necessary to determine the optimized placement.

In general, the placement of the redundant role instances can be done arbitrarily. For all other daemons, except the MONs, locality of the roles might be beneficial for the client access performance if an enhanced latency exists between the failure domains. Usually, this could be the case if failure domains are on a reasonable but significant distance. If the client access to the instance can avoid crossing this significant distance to get to the data access gateway (iSCSI, RGW) or service (MDS), better latency could be achieved as long as the path avoiding it can be used. The gateways could be placed near to the clients, avoiding crossing the boundaries between failure domains. For iSCSI, since always a single session is used only to one of both iSCSI gateways for a single mapping, over time both might be used by the same client – but only after a failover to the other one. If active sessions can then be crafted the way that a local client by default could access the local iSCSI gateway, clients could benefit from this lower latency. In some environments, this could be a hard requirement for some workloads and would influence further the placement of the instances across failure domains, perhaps with a need of additional nodes to accomplish this.


In this article, the last formal elements for the calculation of the number of nodes in collocation with respect to failure domains has been introduced. These constraints and design decisions possible will be applied in examples illustrating the calculation process starting with the next article. Thanks for reading so far and stay tuned!


Leave a Reply

Discover more from Data, Ceph and IT

Subscribe now to keep reading and get access to the full archive.

Continue reading