Managing a cluster with multi-architecture compute machines

Managing a cluster that has nodes with multiple architectures requires you to consider node architecture as you monitor the cluster and manage your workloads. This requires you to take additional considerations into account when you configure cluster resource requirements and behavior, or schedule workloads in a multi-architecture cluster.

Scheduling workloads on clusters with multi-architecture compute machines

When you deploy workloads on a cluster with compute nodes that use different architectures, you must align pod architecture with the architecture of the underlying node. Your workload may also require additional configuration to particular resources depending on the underlying node architecture.

You can use the Multiarch Tuning Operator to enable architecture-aware scheduling of workloads on clusters with multi-architecture compute machines. The Multiarch Tuning Operator implements additional scheduler predicates in the pods specifications based on the architectures that the pods can support at creation time.

For information about the Multiarch Tuning Operator, see Managing workloads on multi-architecture clusters by using the Multiarch Tuning Operator.

Sample multi-architecture node workload deployments

Scheduling a workload to an appropriate node based on architecture works in the same way as scheduling based on any other node characteristic. Consider the following options when determining how to schedule your workloads.

Using nodeAffinity to schedule nodes with specific architectures

You can allow a workload to be scheduled on only a set of nodes with architectures supported by its images, you can set the spec.affinity.nodeAffinity field in your pod’s template specification.

Example deployment with node affinity set

apiVersion: apps/v1
kind: Deployment
metadata: # ...
spec:
   # ...
  template:
     # ...
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values: 
                - amd64
                - arm64

Specify the supported architectures. Valid values include amd64,arm64, or both values.

Tainting each node for a specific architecture

You can taint a node to avoid the node scheduling workloads that are incompatible with its architecture. When your cluster uses a MachineSet object, you can add parameters to the .spec.template.spec.taints field to avoid workloads being scheduled on nodes with non-supported architectures.

Before you add a taint to a node, you must scale down the MachineSet object or remove existing available machines. For more information, see Modifying a compute machine set.

Example machine set with taint set

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata: # ...
spec:
  # ...
  template:
    # ...
    spec:
      # ...
      taints:
      - effect: NoSchedule
        key: multiarch.openshift.io/arch
        value: arm64

You can also set a taint on a specific node by running the following command:

$ oc adm taint nodes <node-name> multiarch.openshift.io/arch=arm64:NoSchedule

Creating a default toleration in a namespace

When a node or machine set has a taint, only workloads that tolerate that taint can be scheduled. You can annotate a namespace so all of the workloads get the same default toleration by running the following command:

Example default toleration set on a namespace

$ oc annotate namespace my-namespace \
  'scheduler.alpha.kubernetes.io/defaultTolerations'='[{"operator": "Exists", "effect": "NoSchedule", "key": "multiarch.openshift.io/arch"}]'

Tolerating architecture taints in workloads

When a node or machine set has a taint, only workloads that tolerate that taint can be scheduled. You can configure your workload with a toleration so that it is scheduled on nodes with specific architecture taints.

Example deployment with toleration set

apiVersion: apps/v1
kind: Deployment
metadata: # ...
spec:
  # ...
  template:
    # ...
    spec:
      tolerations:
      - key: "multiarch.openshift.io/arch"
        value: "arm64"
        operator: "Equal"
        effect: "NoSchedule"

This example deployment can be scheduled on nodes and machine sets that have the multiarch.openshift.io/arch=arm64 taint specified.

Using node affinity with taints and tolerations

When a scheduler computes the set of nodes to schedule a pod, tolerations can broaden the set while node affinity restricts the set. If you set a taint on nodes that have a specific architecture, you must also add a toleration to workloads that you want to be scheduled there.

Example deployment with node affinity and toleration set

apiVersion: apps/v1
kind: Deployment
metadata: # ...
spec:
  # ...
  template:
    # ...
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - amd64
                - arm64
      tolerations:
      - key: "multiarch.openshift.io/arch"
        value: "arm64"
        operator: "Equal"
        effect: "NoSchedule"

Additional resources

Enabling 64k pages on the Red Hat Enterprise Linux CoreOS (RHCOS) kernel

You can enable the 64k memory page in the Red Hat Enterprise Linux CoreOS (RHCOS) kernel on the 64-bit ARM compute machines in your cluster. The 64k page size kernel specification can be used for large GPU or high memory workloads. This is done using the Machine Config Operator (MCO) which uses a machine config pool to update the kernel. To enable 64k page sizes, you must dedicate a machine config pool for ARM64 to enable on the kernel.

Important

Using 64k pages is exclusive to 64-bit ARM architecture compute nodes or clusters installed on 64-bit ARM machines. If you configure the 64k pages kernel on a machine config pool using 64-bit x86 machines, the machine config pool and MCO will degrade.

Prerequisites

You installed the OpenShift CLI (oc).
You created a cluster with compute nodes of different architecture on one of the supported platforms.

Procedure

Label the nodes where you want to run the 64k page size kernel:

$ oc label node <node_name> <label>

Example command

$ oc label node worker-arm64-01 node-role.kubernetes.io/worker-64k-pages=

Create a machine config pool that contains the worker role that uses the ARM64 architecture and the worker-64k-pages role:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-64k-pages
spec:
  machineConfigSelector:
    matchExpressions:
      - key: machineconfiguration.openshift.io/role
        operator: In
        values:
        - worker
        - worker-64k-pages
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-64k-pages: ""
      kubernetes.io/arch: arm64

Create a machine config on your compute node to enable 64k-pages with the 64k-pages parameter.
```
$ oc create -f <filename>.yaml
```
Example MachineConfig
```
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker-64k-pages" 
  name: 99-worker-64kpages
spec:
  kernelType: 64k-pages 
```
1. Specify the value of the machineconfiguration.openshift.io/role label in the custom machine config pool. The example MachineConfig uses the worker-64k-pages label to enable 64k pages in the worker-64k-pages pool.
2. Specify your desired kernel type. Valid values are 64k-pages and default
  Note
  
  The 64k-pages type is supported on only 64-bit ARM architecture based compute nodes. The realtime type is supported on only 64-bit x86 architecture based compute nodes.

Verification

To view your new worker-64k-pages machine config pool, run the following command:

$ oc get mcp

Example output

NAME     CONFIG                                                                UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9d55ac9a91127c36314e1efe7d77fbf8                      True      False      False      3              3                   3                     0                      361d
worker   rendered-worker-e7b61751c4a5b7ff995d64b967c421ff                      True      False      False      7              7                   7                     0                      361d
worker-64k-pages  rendered-worker-64k-pages-e7b61751c4a5b7ff995d64b967c421ff   True      False      False      2              2                   2                     0                      35m

Importing manifest lists in image streams on your multi-architecture compute machines

On an OpenShift Container Platform 4.19 cluster with multi-architecture compute machines, the image streams in the cluster do not import manifest lists automatically. You must manually change the default importMode option to the PreserveOriginal option in order to import the manifest list.

Prerequisites

You installed the OpenShift Container Platform CLI (oc).

Procedure

The following example command shows how to patch the ImageStream cli-artifacts so that the cli-artifacts:latest image stream tag is imported as a manifest list.
```
$ oc patch is/cli-artifacts -n openshift -p '{"spec":{"tags":[{"name":"latest","importPolicy":{"importMode":"PreserveOriginal"}}]}}'
```

Verification

You can check that the manifest lists imported properly by inspecting the image stream tag. The following command will list the individual architecture manifests for a particular tag.

$ oc get istag cli-artifacts:latest -n openshift -oyaml

If the dockerImageManifests object is present, then the manifest list import was successful.

Example output of the dockerImageManifests object

dockerImageManifests:
  - architecture: amd64
    digest: sha256:16d4c96c52923a9968fbfa69425ec703aff711f1db822e4e9788bf5d2bee5d77
    manifestSize: 1252
    mediaType: application/vnd.docker.distribution.manifest.v2+json
    os: linux
  - architecture: arm64
    digest: sha256:6ec8ad0d897bcdf727531f7d0b716931728999492709d19d8b09f0d90d57f626
    manifestSize: 1252
    mediaType: application/vnd.docker.distribution.manifest.v2+json
    os: linux
  - architecture: ppc64le
    digest: sha256:65949e3a80349cdc42acd8c5b34cde6ebc3241eae8daaeea458498fedb359a6a
    manifestSize: 1252
    mediaType: application/vnd.docker.distribution.manifest.v2+json
    os: linux
  - architecture: s390x
    digest: sha256:75f4fa21224b5d5d511bea8f92dfa8e1c00231e5c81ab95e83c3013d245d1719
    manifestSize: 1252
    mediaType: application/vnd.docker.distribution.manifest.v2+json
    os: linux