Collocation and number of nodes in a Red Hat Ceph cluster


[ 23 min read ]

Since a couple of versions, the Red Hat Ceph Storage product supports the collocation of instances of different roles on the same nodes. The instances for those roles cannot be placed randomly together but should follow rules. Those have been described in the docs and in a knowledge base article and implement some guidelines one should follow for a fully supported cluster design. The effective possible configurations have been frequently source of discussion and this post tries to help people new to the Ceph cluster design get it right from the start.

Previously, before Red Hat Ceph Storage version 4, the default deployment was to have a distinct machine for each of the instances, except the OSDs. This scheme was exhaustive on hardware, even if some of the roles didn’t require a lot of compute resources. This old scheme avoided influence between the different roles located on the same nodes and prevented resource shortages, however, with a lot more costs for separation across the nodes. The newer approach is to use containers on the cluster nodes for resource management as well as isolation between the different roles. Based on testing and customer experience, not all roles found to be suitable with all different kinds of neighbours.

The foundation of using collocation

In general, the foundation of combining different roles onto the same nodes is the idea of saving on hardware and infrastructure costs. As every node requires additional space in the data centre, requires additional ports in network switches, and additional effort for cooling and power, the idea is to place as many as possible additional roles on the already beefy nodes for the OSDs managing the media.

All kinds of roles can be located on the same nodes as OSDs run. Those roles are: the monitors (MON), the managers (MGR), the Object Storage Gateway (RGW, aka RADOS gateway), metadata servers for CephFS (MDS), the dashboard server, the iSCSI gateways (iSCSI), the NFS servers for CephFS (NFS/cephfs), NFS servers for RGW (NFS/rgw), and the rbd-mirror daemons. This is the list for Red Hat Ceph Storage 5 with iSCSI gateway not being in the product anymore with the version 6.

All those roles have usually multiple instances providing a higher availability than a single instance of the services. Some of those have a required minimum for production, like the monitors that need 3 instances. The only role that has no redundancy as of now is the dashboard server. Other roles, like the RGW, might be ok with 2 instances only but for scalability reasons might require more. Also, there are some roles that have a limited number of instances allowed in a cluster, like the iSCSI gateway.

Special handling for a few ones

Some of the above listed roles are a little bit unpredictable and may impact other roles if those would run on the same nodes. Those allowed to be collocated with OSDs but with no other roles on the same nodes are:

  • iSCSI gateway
  • rbd-mirror

Some of the roles might provide additional services to another role. These are namely the NFS gateways that supplement the CephFS and RGW services. Those are seen as an extension to those roles, based on those roles, and would even benefit from a close coupling on the same nodes. Those are then seen as to be running alongside with their roles they support:

  • NFS gateway for CephFS – alongside with metadata server (MDS)
  • NFS gateway for RGW – alongside with the RGW instances

There is one additional role that is recommended to be deployed alongside another one: the manager (MGR) instances should be deployed on the same nodes as the monitors (MON).

The ground rules

Any role can be deployed on individual – distinct – nodes. Even no additional role instance needs to be deployed on an OSD node. It’s possible to use, as in the traditional model, just a distinct node per role instance.

All nodes with collocated role instances can have as many as supported number of OSDs. There are other rules and dependencies that limit the number of OSDs per node. Additional restrictions are in place by the amount of resources that can be allocated to the OSDs, namely the number of cores and the amount of memory, and the available bandwidth by the network.

Whatever the maximum number of OSDs per node would be, additional resources of CPU, memory, and network must be available to the operating system, and every additional role instance adds the need for additional resources. Planning a collocation of the roles consequently needs a special care about supplying sufficient resources to all the parts, being the operating system, all the OSDs, and all the additional roles.

Instances of roles that allow no other neighbours on the same node can be deployed on any OSD node. Any additional instance of the same role is not allowed on the same node. As an example, taking an iSCSI gateway, one can install one instance of it on one node but the second instance must be installed on another node. The other selected node, however, also cannot run other roles instances, beside the OSDs.

The roles that allow other neighbours on the same nodes can run other roles instances, in addition to the OSDs. There is a limitation that all individual nodes cannot run more than 2 additional role instances out of this list. An additional restriction is that one cannot run 2 instances of the same additional role on the same node. Although it was found that 2 RGW instances on the same node would provide better throughput compared to running those 2 instances on different nodes, the actual rules don’t allow this.

Leaving the divas in their privacy – one node for each diva

The two kinds of roles that would not allow another neighbor on the same node are the iSCSI gateway and the rbd-mirror.

The number of iSCSI gateways supported is between 2 and 4. Additional gateways are not allowed in the same cluster.

Instances for rbd-mirror need their own nodes as well and must be separated. Until recently, the number of instances per cluster was limited. With the newer version of Red Hat Ceph Storage 6, more instances are supported for scalability.

Both kinds of special roles would require a separate node to run an individual instance. The number of required nodes for these kinds of role instances equals exactly the number of role instances needed:

Sum_d = #iSCSI_gw + #rbd_mirror

Scaling for more

The initial list of roles have several kinds that can be deployed alongside with other roles instances on the same nodes. Those are usually called the “scale-out” daemons. Those are actually: the MON/MGR couple, MDS, RGW, and the dashboard server.

The number of role instances can have defined minimum numbers and reasonable numbers of instances for other purposes.

Monitors should be a minimum of 3 in any production cluster. More monitors can be used and a common number of monitors is 5. Also with larger clusters, more monitors might be a good idea as the number of instances to serve is larger. A recommended maximum currently is 9 monitors. Always, as in the rules of Ceph, an odd number of monitors is required in a stable cluster. The latter side node reflects that this is a best practice that should be followed but for some situations you might have an intermediate even number of monitors, especially when adding and removing monitors to adapt to a new failure domain structure, etc. In any configuration scheme, since you cannot place two monitor instances on the same node, the minimum number of nodes required in the cluster to build equals at least the number of monitors deployed. 

Managers (MGR) should be deployed alongside the monitor instances. The recommended number of MGR instances is 3 to be able to always find a surviving instance to control the cluster. However, since those would run alongside the monitor instances, no additional nodes are needed.

RGW instances should be a minimum of 2 instances to avoid losing the connection for all clients once the node where the RGW instance runs on becomes unreachable or the instance has some trouble. Since S3 protocol is stateless and a gateway has only a limited throughput it can provide, mutliple RGW instances can be added to the cluster in order to grow the possible bandwidth for S3 access. Those additional instances then need additional nodes, since one cannot place two RGW instances on the same node. 

An additional scale might be required if the RGW instances must be separated for some reason. This could be the case if there should be different zones be serviced, another realm with totally separated namespace is needed, or the network access must be separated, etc. All these additional separations will require additional sets of RGW instances, all usually configured with a minimum of 2 instances for redundancy or more for higher bandwidth.

Each CephFS filesystem configured in the cluster should have at least one active metadata server (MDS) and one passive MDS. Additional MDS could be desired if the load for one active MDS might be too high or too many files might require additional space for caching the metadata. Additional MDS instance raises the likelihood of a failure of one of the active MDS and additional passive MDS instances might be desired to be available as a stand-in. All these individual instances must be placed on different nodes. 

The dashboard server is the only one that will have a single instance only, actually. Also, this instance needs to be placed on a node. Alternatively, since the dashboard server is a single instance only, finding another place on a more reliable structure than a single node could be an option too – a highly available VM could to the trick and no node would be needed then for the dashboard server role in the Ceph cluster itself.

The NFS daemons for CephFS and RGW are additional services to the roles and should run alongside with their base roles. The NFS services thus require no additional nodes to be placed, although one could decide to even separate those from their base roles. Sometimes, it might be desired to have even fewer NFS services instances deployed than base role instances. However, this would not change the number of needed nodes if the service is deployed alongside with the role instance.

Friendly roles save resources – one node for two

All the above roles that are counted as “scale-out daemons” can have a different neighbor on the same node. However, always a maximum of two different additional role instances are allowed on each node, regardless of the number of OSDs deployed on this node.

The overall number of nodes required is now easy to find using simple calculations:

  • the minimum number of nodes is the half of all the mentioned role instances without special handling, at any time: as an example, having 3 MONs, 2 RGWs, 2 MDS would give 7 instances => 4 nodes, with 3 nodes having two instances and one node having one instance;
  • the minimum number of nodes is the maximum of all role instances of the same kind: as an example, having 6 RGW instances would require 6 nodes as the minimum, each one carrying one RGW instance;
  • and from both of these the larger one.

This gives the formula as:

Sum_s
max ( ROUNDUP(
                   ( #MONs
+ [#RGWs_use_case_1 + .. + #RGWs_use_case_N] 
                     + [#MDSs_use_case_1 + .. + #MDSs_use_case_M]
+ dashboard_server
                   ) / 2
),
            max ( #MONs,
(#RGWs_use_case_1 + .. + #RGWs_use_case_N), 
                      (#MDSs_use_case_1 + .. + #MDSs_use_case_M)
           )
  )

The pure math, of course, might not be sufficient for a proper design but gives only the minimum number of nodes in the plain assumption of having a flat failure domain structure with the nodes alone being the only failure domain. With every rack structure, network switch hierarchies, power or cooling grid structure, etc., the number of nodes might grow. This will be discussed in a later post, after the discussion of the basic minimum calculation is complete.

Summing it up for the minimum number of nodes

The minimum number of nodes to place all the role instances properly will be the sum of the special nodes for our divas plus the number of nodes required for our “scale-out daemons”:

Sum_nodes_min = Sum_d + Sum_s

and finally (for now):

Sum_nodes_min = #iSCSI_gw
+ #rbd_mirror
+ max ( ROUNDUP(
                   ( #MONs
+ [#RGWs_use_case_1 + .. + #RGWs_use_case_N] 
                      + [#MDSs_use_case_1 + .. + #MDSs_use_case_M]
+ dashboard_server
                    ) / 2
),
            max ( #MONs,
(#RGWs_use_case_1 + .. + #RGWs_use_case_N), 
                       (#MDSs_use_case_1 + .. + #MDSs_use_case_M)
            )
  )

Recall that the dashboard server perhaps might not be installed on a Ceph cluster node.

Example minimum configurations

Case A: A simple cluster with 3 MONs, 2 MDS, 2 iSCSI gateways, but 2 different zones for RGW should be created, and the dashboard server should be installed on the cluster nodes. Because the two different RGW zones should be supported with redundant instances, we would need 4 instances, two for each zone. Each of the zones would represent a use case for us. The calculation would be:

Sum_nodes_min = 2 + 0  +
                                           max ( ROUNDUP(
                                                  ( 3 + [2 + 2] 
                                                    + [2] + 1
                                                   ) / 2),
                                                   max ( 3, (2 + 2), (2) )
                                             )

                            = 2 + max ( ROUNDUP (10/2), max (3, 4, 2) )

                                        = 2 + 5 = 7

A viable deployment could look like:

  • 3 nodes having MON/MGR + RGW instances,
  • one node has one MDS and one RGW instance,
  • one node has the other MDS plus the dash-board server
  • and two nodes have one iSCSI gateway per node.

The example drawing below uses only 2 OSDs per node but those could be more:

Case B: A complex cluster should be created with: 

  • 5 MONs, 
  • CephFS filesystem 1 with 1 active and 1 passive MDS, 
  • CephFS filesystem 2 with 2 active and one passive MDS, 
  • 2 different sets of redundant iSCSI gateways, 
  • 2 zones for RGW realm One with 2 instances each, 
  • 3 zones for multi-site replicated RGW realm Two with different networks and each one 4 RGW for higher throughput,
  • the dashboard should be hosted by a VM.

We’ve got two separate CephFS file systems to create = 2 use cases for MDSs. We’ve got all in all 5 different use cases for RGW. The dashboard doesn’t count since it’s not located on a cluster node.

Because the two different RGW zones should be supported with redundant instances, we would need 4 instances. Each of the zones would represent a use case for us. The calculation would be:

Sum_nodes_min = 2 * 2 + 0  +
                                           max ( ROUNDUP(
                                                  ( 5 + [2 + 2 + 3 * 4] 
                                                    + [2 + 3] + 0
                                                   ) / 2),
                                                   max ( 5, (2 + 2 + 12), (2 +3) )
                                             )

                            = 4 + max ( ROUNDUP (26/2), max (5, 16, 5) )

                                        = 4 + 16 = 20

A viable deployment could look like:

  • 5 nodes with MON/MGR + RGW,
  • 5 nodes with MDS + RGW,
  • one node with dashboard + RGW,
  • 4 nodes with iSCSI gateways,
  • and 5 nodes with RGW only for the remainder of the needed RGW instances.

The example drawing below uses only 2 OSDs per node but those could be more:

Case C: A small but sophisticated cluster with 3 MONs, 2 MDS, 2 iSCSI gateways, but 2 different zones for RGW should be created, and a dashboard server. In addition, the CephFS should be available through NFS. Because the two different RGW zones should be supported with redundant instances, we would need 4 instances. Each of the zones would represent a use case for us.

This case is very similar to Case A with the deviation that we would add another service to one of the roles. The calculation would be also similar, since the additional services don’t count to the number of role instances for the collocation:

Sum_nodes_min = 2 + 0  +
                                           max ( ROUNDUP(
                                                  ( 3 + [2 + 2] 
                                                    + [2] + 1
                                                   ) / 2),
                                                   max ( 3, (2 + 2), (2) )
                                             )

                            = 2 + max ( ROUNDUP (10/2), max (3, 4, 2) )

                                        = 2 + 5 = 7

Although the additional services don’t grow the number of nodes needed, the resources must be available on the nodes to those services that should run in addition to the role instances. Also here, the example drawing uses only 2 OSDs per node but those could be more and a viable deployment could look like:

Case D: If we would like to add NFS gateways to all the RGW instances for up-/download from NFS to the object world, we would add for each and every RGW role instance a NFS service instance. As in the example C for access to CephFS via the NFS gateways, this would not grow the number of needed nodes. However, the resources for those additional services must be available on those nodes. The following drawing shows this example configuration.

Preliminary conclusion

This article showed the ground rules of what collocation would require for supported configurations of Red Hat Ceph Storage and showed the calculation of the minimum number of nodes required. However, since a Ceph cluster might be structured across different failure domains, the calculation shown here is only for clusters with a flat failure domain structure where only the hosts are to distinguish. This also reflects on nodes being in their own chassis only, not sharing a chassis with other nodes.


The additional things to check and plan for with regards to failure domains will be theme of the next article in this series. Also, we’ll carry this forward with reflection of the structure of individual redundancy schemes for the different pools that might be involved. So, stay tuned for the next article coming soon.


One response to “Collocation and number of nodes in a Red Hat Ceph cluster”

Leave a Reply

Discover more from Data, Ceph and IT

Subscribe now to keep reading and get access to the full archive.

Continue reading