An ODF cluster for 2 main DC – preparation

[17 min read]

An ODF cluster for 2 main DC – preparation

OpenShift Data Foundation in the standard installation would be installed using three different failure domains or just on a single failure domain. Many companies, however, have only two main data centers. Such legacy architectures were established for stretched deployments with a shared storage supporting the access and redundancy to any of both sites. OpenShift Data Foundation can provide both scalability and independence from underlying storage while providing out-of-the-box redundancy across two main data centers or failure domains for the OpenShift platform.

In the article “Installing the first OpenShift cluster for ODF“, I described the installation of an ordinary OpenShift cluster as a starting point, and in “Installing the first ODF cluster“, the installation of an ordinary ODF cluster. The difference between those and the actual cluster will be that instead of having 3 different equal failure domains I’ve got only 2 main equal failure domains or locations. The 3rd location is not a main one like usually those that have been built to provide such a thing like a quorum or a relict from previous times when a smaller one was sufficient.

The ODF high availability stretch cluster

Previously, ODF was able to bridge availability zones or alike in cloud environments, using 3 failure domains. OpenShift workloads were able to run in any of the available failure domains and eliminate the risk of local failures within a zone affecting all workloads. A workload was simply able to continue to run on any other node in the OpenShift cluster without losing the access to its data.

With the feature of a stretched deployment with only 2 failure domains, called the High availability stretch cluster, or sometimes called the simply the stretch cluster, this capability is going to be available also in other environments, like classic on-premises data centers with only 2 failure domains. This resiliency to data center outages or network separation is now providing the application continuity while maintaining the access to the data even in those classic environments.

Note that this feature doesn’t cover the challenges for distributing the OpenShift control plane nodes properly to withstand an outage of a location.

Infrastructure requirements and OpenShift structure

ODF is a loosely coupled system. Managing the data redundancy across any infrastructure boundaries is done via the network available on both sides. There is always the risk of a “split brain” scenario. The cluster needs to sort out the way to work if one half of the cluster is unavailable for some reason. The approach to overcome the challenge with it in ODF is to use an arbiter instance, a monitor (aka MON) for Ceph terms, that is placed in a 3rd location. In the drawing this is represented by the zone called “arbiter”. The arbiter provides the failure fencing for this distributed infrastructure to prevent split brain sets of nodes in the network. It helps to ensure a proper quorum for the monitors of the ODF cluster. It’s a proven technology approach from underlying Ceph technology.

OpenShift would have the challenge to find a place for the 3rd control plane node. In my lab environment, I use the 3rd zone to locate both the 3rd control plane node and the arbiter. In such a configuration, I can then use the control plane node of OpenShift to run the arbiter pod so that I don’t need an additional separate node for it. It’s illustrated by using a light green color to picture this. In general, the arbiter pod can also be deployed independently of the 3rd control plane node. It could be rather a worker node or an infra node that belongs to the OpenShift cluster. While the control plane node will need a far lower RTT for the etcd to work properly, the ODF arbiter pod could be placed somewhere else involving a higher latency up to 100 ms, like into a public cloud. One should consult the Red Hat Customer Support if considering this to ensure proper functionality.

The OpenShift cluster itself has no preference for the topology by default, nor is it aware of the layout of the nodes if not deployed into a structured environment like into zones of availability within a cloud. Because the OpenShift cluster will be also stretched across two zones, one will need to apply further topology information to the nodes.

The infrastructure must provide a minimum of 2 nodes in each of the main data centers providing media to ODF. An even distribution of the available capacity is expected but also an even distribution of performance. Ideally, the same number and size of media is provided on each of the data centers.

Since a stretched environment always introduces additional latency, the media should be flash media with enough performance. While using enterprise storage (SAN) is supported, the performance and latency figures should be reviewed if it’s not an all-flash arrangement.

OpenShift cluster for 2 main DC – simulated

The installation of the OCP cluster for a 2-main-DC installation is a little bit different from the default configuration when either one or three failure domains are involved. The installation will have 3 control plane nodes, since exactly 3 of those are supported only. The worker nodes should be configured to both main DC zones but also the ODF cluster would need at least 4 nodes instead of only 3 in the default configuration. In this regard, I need to change the configuration of the IPI.

For my limited environment, I choose to place the ODF components on the worker nodes. In a production environment, this could also be done, however, a valid alternative could be the use of infra nodes. This would also allow to independently scale the number of worker nodes or the number of ODF nodes. Important for planning could be the requirement of the equal capacity and node distribution for both DCs. It might be easier to scale only the ODF part if the storage layer requires an upgrade, or scale the worker nodes if more resources are required for the applications. In addition, special nodes like some with GPU might be required and mixing the roles within a cluster for saving on the ODF nodes might not be the best approach.

In my installation, I’ll be using the worker nodes for demonstration. During the installation using the openshift-install tool, I need to introduce the number of workers to be 4 plus make some adjustments to the resources of the nodes.

I start with creating the install config files:

[root@ocp21-jump ~]# ./openshift-install create install-config --dir=ocp411
? Platform alibabacloud
? Alibaba Cloud Access Key ID [? for help] FATAL failed to fetch Install Config: failed to fetch dependency of "Install Config": failed to fetch dependency of "Base Domain": failed to generate asset "Platform": interrupt 
[root@ocp21-jump ~]# ./openshift-install create install-config --dir=ocp411
? Platform ovirt
? Cluster Default
? Storage domain Data2
? Network net_210
? Internal API virtual IP 172.21.50.6
? Ingress virtual IP 172.21.50.7
? Base Domain dslab.local
? Cluster Name ocp21
? Pull Secret [? for help] ***********************************************************************************************************************************************************************INFO Install-Config created in: ocp411            ************************************************************************************************************************************************[root@ocp21-jump ~]#

Once those have been created during the previous step, I need to change two parameters: one for relaxing of the placement of nodes to different hypervisors that I don’t have in my lab, and one for changing the number of nodes for workers:

[root@ocp21-jump ~]# cat ocp411/install-config.yaml 
apiVersion: v1
baseDomain: dslab.local
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    ovirt:
      affinityGroupsNames:
      - compute
  replicas: 4
...
...
platform:
  ovirt:
    affinityGroups:
    - description: AffinityGroup for spreading each compute machine to a different
        host
      enforcing: false
      name: compute
      priority: 3
    - description: AffinityGroup for spreading each control plane machine to a different
        host
      enforcing: false
      name: controlplane
      priority: 5
...
[root@ocp21-jump ~]#

With this change, the manifests can now be created, picking up the changes made.

[root@ocp21-jump ~]# ./openshift-install create manifests --dir=ocp411
INFO Consuming Install Config from target directory 
INFO Manifests created in: ocp411/manifests and ocp411/openshift 
[root@ocp21-jump ~]#

Initially, as described in “Installing the first ODF cluster“, ODF would require ~10 vCPU and 24 GB of memory on every worker node for the desired configuration. I decided to use higher resources for leaving room to add further devices to ODF later on and to have some space to, of course, run services on the worker nodes as well. I’ll go with 16 vCPU and 40 GiB of memory for the worker nodes.

Also, the 3rd control plane node residing in the 3rd zone for placing the arbiter monitor pod will need more resources. I would like to apply 6 vCPU and 18 GiB of memory, providing 2 vCPU and 2 GiB or memory on top of the default settings in the *master_machines-1.yaml file:

[root@ocp21-jump ~]# cat ocp411/openshift/99_openshift-cluster-api_master-machines-1.yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  ...
  name: ocp21-kvm2z-master-1
  namespace: openshift-machine-api
spec:
  ...
  providerSpec:
    value:
      ...
      cpu:
        cores: 6
        sockets: 1
        threads: 1
      memory_mb: 18432
      ...
...
[root@ocp21-jump ~]#

The changes to the worker node configuration must be provided in the machineset definition file, here *worker-machineset-0.yaml. The worker nodes I changed to provide 16 vCPU plus 40GiB memory:

[root@ocp21-jump ~]# cat ocp411/openshift/99_openshift-cluster-api_worker-machineset-0.yaml 
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  ...
  name: ocp21-kvm2z-worker
  namespace: openshift-machine-api
spec:
  replicas: 4
  ...
  template:
    ...
    spec:
      ...
      providerSpec:
        value:
          ...
          cpu:
            cores: 16
            sockets: 1
            threads: 1
          memory_mb: 40960
          ...
...
[root@ocp21-jump ~]

With these changes, I can start the installation. It will take a while and should finish with a working OCP cluster with 3 control plane nodes and 4 worker nodes.

[root@ocp21-jump ~]# ./openshift-install create cluster --dir=ocp411 --log-level=debug
...
...
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 
INFO     export KUBECONFIG=/root/ocp411/auth/kubeconfig 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ocp21.dslab.local 
INFO Login to the console with user: "kubeadmin", and password: "XXXXX-XXXXX-XXXXX-XXXXX" 
DEBUG Time elapsed per stage:                      
DEBUG              image: 39s                      
DEBUG            cluster: 3m47s                    
DEBUG          bootstrap: 1m39s                    
DEBUG Bootstrap Complete: 16m18s                   
DEBUG                API: 1m19s                    
DEBUG  Bootstrap Destroy: 1m12s                    
DEBUG  Cluster Operators: 7m7s                     
INFO Time elapsed: 30m51s                         
[root@ocp21-jump ~]#

Preparing for use with ODF – the local storage configuration

The provided new environment is:

[root@ocp21-jump ~]# oc get nodes
NAME                       STATUS   ROLES    AGE   VERSION
ocp21-kvm2z-master-0       Ready    master   70m   v1.24.6+5157800
ocp21-kvm2z-master-1       Ready    master   70m   v1.24.6+5157800
ocp21-kvm2z-master-2       Ready    master   70m   v1.24.6+5157800
ocp21-kvm2z-worker-bcjpx   Ready    worker   59m   v1.24.6+5157800
ocp21-kvm2z-worker-swzwh   Ready    worker   56m   v1.24.6+5157800
ocp21-kvm2z-worker-w9gz2   Ready    worker   59m   v1.24.6+5157800
ocp21-kvm2z-worker-xlth6   Ready    worker   58m   v1.24.6+5157800
[root@ocp21-jump ~]#

I skip the completion of the configuration of the OCP cluster but carry on with the ODF installation.

Some storage needs to be provided to the worker nodes in order to provide any capacity to work with. Since the ODF stretch cluster is only supported using local storage, I need to manually provision the storage devices to the nodes. In any other environment than a Bare Metal deployment using internal media where the hardware storage media are already visible to the nodes as block devices, the storage devices must be provisioned before moving on to the ODF deployment. Note also the things about recognizing the devices as flash devices – yes, those should be flash in any deployment – which was mentioned in “Installing the first OpenShift cluster for ODF” during the deployment of the cluster section.

ODF stretch clusters can only be deployed as internal mode using the Local Storage Operator (LSO). For the proper distribution of the capacity we need to make sure to deploy those devices across the nodes in a distinct manner. We need to provide equal capacity within both the failure domains as well as across the nodes. For our minimum installation with only 4 ODF nodes involved, this is even more essential because we need to surely distribute our 4 replicas across a sufficient number of independent nodes and also for all our data. We always need at least 2 nodes in each of the failure domains, providing 2 independent replicas.

Currently being the only supported way, we need to provide the same capacity across our nodes. In real production environments, this is a requirement for various reasons, especially for smaller clusters.

For long running clusters, especially in bare metal environments, using the same capacity across any nodes would perhaps not be possible for the whole lifetime. The capacity of available flash devices might change. However, as in an ordinary Ceph cluster, the capacity must be provided properly in the involved failure domains or one wouldn’t be able at some point in time to allocate enough capacity to create new data. With more nodes involved per failure domain, the capacity distribution might vary over time. But still then it might be of special benefit to maintain mostly equal capacity distribution if not introducing dramatically faster and more powerful new nodes: the data distribution of active data determines the utilization

of the involved nodes,
of the involved media, and
of the network connections.

Just even if one would provide enough resources in terms of CPU and memory to be able to run the required OSD pods, there might be additional limitations, why one would stick with the recommendation of equally sized nodes.

In my test cluster, all nodes here get two devices, each one with a size of 100 GB. Note that one of the nodes, here it’s the 3rd one, have an already deployed additional device baked in by the IPI used within RHV that holds the PV for the internal registry:

[root@ocp21-jump ~]# oc get -A pvc
NAMESPACE                  NAME                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
openshift-image-registry   image-registry-storage   Bound    pvc-7cfc099e-8aba-48b3-9bc6-e7fad00b003f   100Gi      RWO            ovirt-csi-sc   72m
[root@ocp21-jump ~]#

Preparing the ODF installation for “HA stretch cluster”

We are now in the first step of the effective ODF installation. Until now, all things, except the labeling of the nodes, is something more or less specific to the provisioning of any OCP environment. Numerous ways of automating this are possible. Different from the default installation is that the HA stretch cluster deployment is meant to only cover situations where exactly 2 main DCs are available. It should not be used for anything with one DC or more than 2 main DCs.

Zones in the cluster

Since most environments don’t really advertise the information about which nodes belong to which DC, some kind of workaround can be used to provide this information to the ODF cluster and thus being able to craft a proper failure domain layout.

The “metro stretch” deployment uses different nodes in different failure domains. According to the documentation, each node must be properly assigned to a “topology” zone. This is also required for the control plain nodes if one wants to deploy the arbiter monitor pod to this 3rd “neutral” zone and this very same node.

The structure of the topology zones should reflect the layout of the infrastructure that the cluster is run in. In my environment, there is no difference since everything is running on the same virtualisation host. In real virtualized environments where only two main DCs are available, the nodes should be placed into the DCs with affinity set. One should have the same number of nodes for the use of ODF in both DCs, but at least the number of devices used for OSDs must match and the minimum number of nodes per DC must be 2 to be able host the 4 replicas properly on different nodes.

It’s important to understand that the failure domains are fixed for now: only one additional failure domain definition is possible, in addition to the nodes as one basic failure domain for the OSDs residing there. It’s not possible to separate a DC nodes distribution into additional domains for different racks, etc. When designing the ODF cluster node distribution, either all nodes should be then in the same failure domain, like the same rack or all in different racks that can be treated to be not different than the node itself as it’s representing the same failure scenario.

It’s important to understand that a DC is considered to be down not only if the OSDs are also down: if 2 monitor roles in the same DC fail, the whole site – so also the OSDs in this site – will become unavailable, no matter how many other OSDs are available there that don’t reside on the nodes of the monitor pods. Losing both monitors in the same DC is equivalent to a complete DC outage for the internal Ceph cluster. No reads or writes are served in this zone until at least one of the local monitor pods in the affected DC comes back up or gets replaced by the rook-operator.

The configuration about the failure domains for the ODF cluster is in on-premises environments usually not available by default and must be introduced for the proper setup of the cluster. Applying a topology zone label providing the information is then the way to go. The names of the topology zones are free form and could reflect the environment names or whatever is convenient to understand the location of the nodes. The labels must be added before the installation of the ODF cluster.

I’ll use the following scheme of node distribution for my case:
zone DC1 => *master-0, *worker-bcjpx, *worker-swzwh
zone arbiter => (neutral zone) *master-1
zone DC2 => *master-2, *worker-w9gz2, *-worker-xlth6

[root@ocp21-jump ~]# oc label node ocp21-kvm2z-master-0 topology.kubernetes.io/zone=DC1
node/ocp21-kvm2z-master-0 labeled
[root@ocp21-jump ~]# oc label node ocp21-kvm2z-worker-bcjpx topology.kubernetes.io/zone=DC1
node/ocp21-kvm2z-worker-bcjpx labeled
[root@ocp21-jump ~]# oc label node ocp21-kvm2z-worker-swzwh topology.kubernetes.io/zone=DC1
node/ocp21-kvm2z-worker-swzwh labeled
[root@ocp21-jump ~]# oc label node ocp21-kvm2z-master-1 topology.kubernetes.io/zone=arbiter
node/ocp21-kvm2z-master-1 labeled
[root@ocp21-jump ~]# oc label node ocp21-kvm2z-master-2 topology.kubernetes.io/zone=DC2
node/ocp21-kvm2z-master-2 labeled
[root@ocp21-jump ~]# oc label node ocp21-kvm2z-worker-w9gz2 topology.kubernetes.io/zone=DC2
node/ocp21-kvm2z-worker-w9gz2 labeled
[root@ocp21-jump ~]# oc label node ocp21-kvm2z-worker-xlth6 topology.kubernetes.io/zone=DC2
node/ocp21-kvm2z-worker-xlth6 labeled
[root@ocp21-jump ~]#

One should check the node labels to have really set up what was intended to do. A proper setup of the zone labels will determine the functional correctness of the ODF cluster.

[root@ocp21-jump ~]# oc get nodes -l topology.kubernetes.io/zone=arbiter -o name
node/ocp21-kvm2z-master-1
[root@ocp21-jump ~]# oc get nodes -l topology.kubernetes.io/zone=DC1 -o name
node/ocp21-kvm2z-master-0
node/ocp21-kvm2z-worker-bcjpx
node/ocp21-kvm2z-worker-swzwh
[root@ocp21-jump ~]# oc get nodes -l topology.kubernetes.io/zone=DC2 -o name
node/ocp21-kvm2z-master-2
node/ocp21-kvm2z-worker-w9gz2
node/ocp21-kvm2z-worker-xlth6
[root@ocp21-jump ~]#

The preparation is finished here and the next article will continue with the installation. Thanks for reading so far this time!

Data, Ceph and IT

An ODF cluster for 2 main DC – preparation