Controlling pod placement on nodes using node affinity rules
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled.
In OpenShift Container Platform node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.
Understanding node affinity
Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.
For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.
There are two types of node affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Note
If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.
You configure node affinity through the Pod spec file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example is a Pod spec with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth and whose value is either e2e-az-North or e2e-az-South:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: e2e-az-NorthSouth
operator: In
values:
- e2e-az-North
- e2e-az-South
containers:
- name: with-node-affinity
image: docker.io/ocpqe/hello-pod
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
# ...
- The stanza to configure node affinity.
- Defines a required rule.
- The key/value pair (label) that must be matched to apply the rule.
- The operator represents the relationship between the label on the node and the set of values in the
matchExpressionparameters in thePodspec. This value can beIn,NotIn,Exists, orDoesNotExist,Lt, orGt.
The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest and whose value is either e2e-az-East or e2e-az-West is preferred for the pod:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: e2e-az-EastWest
operator: In
values:
- e2e-az-East
- e2e-az-West
containers:
- name: with-node-affinity
image: docker.io/ocpqe/hello-pod
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
# ...
- The stanza to configure node affinity.
- Defines a preferred rule.
- Specifies a weight for a preferred rule. The node with highest weight is preferred.
- The key/value pair (label) that must be matched to apply the rule.
- The operator represents the relationship between the label on the node and
the set of values in the
matchExpressionparameters in thePodspec. This value can beIn,NotIn,Exists, orDoesNotExist,Lt, orGt.
There is no explicit node anti-affinity concept, but using the NotIn or DoesNotExist operator replicates that behavior.
Note
If you are using node affinity and node selectors in the same pod configuration, note the following:
-
If you configure both
nodeSelectorandnodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTermsassociated withnodeAffinitytypes, then the pod can be scheduled onto a node if one of thenodeSelectorTermsis satisfied. -
If you specify multiple
matchExpressionsassociated withnodeSelectorTerms, then the pod can be scheduled onto a node only if allmatchExpressionsare satisfied.
Configuring a required node affinity rule
Required rules must be met before a pod can be scheduled on a node.
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.
-
Add a label to a node using the
oc label nodecommand:$ oc label node node1 e2e-az-name=e2e-az1Tip
You can alternatively apply the following YAML to add the label:
kind: Node apiVersion: v1 metadata: name: <node_name> labels: e2e-az-name: e2e-az1 #... -
Create a pod with a specific label in the pod spec:
-
Create a YAML file with the following content:
Note
You cannot add an affinity directly to a scheduled pod.
Example outputapiVersion: v1 kind: Pod metadata: name: s1 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: e2e-az-name values: - e2e-az1 - e2e-az2 operator: In #...- Adds a pod affinity.
- Configures the
requiredDuringSchedulingIgnoredDuringExecutionparameter. - Specifies the
keyandvaluesthat must be met. If you want the new pod to be scheduled on the node you edited, use the samekeyandvaluesparameters as the label in the node. - Specifies an
operator. The operator can beIn,NotIn,Exists, orDoesNotExist. For example, use the operatorInto require the label to be in the node.
-
Create the pod:
$ oc create -f <file-name>.yaml
-
Configuring a preferred node affinity rule
Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.
-
Add a label to a node using the
oc label nodecommand:$ oc label node node1 e2e-az-name=e2e-az3 -
Create a pod with a specific label:
-
Create a YAML file with the following content:
Note
You cannot add an affinity directly to a scheduled pod.
apiVersion: v1 kind: Pod metadata: name: s1 spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: preference: matchExpressions: - key: e2e-az-name values: - e2e-az3 operator: In #...- Adds a pod affinity.
- Configures the
preferredDuringSchedulingIgnoredDuringExecutionparameter. - Specifies a weight for the node, as a number 1-100. The node with highest weight is preferred.
- Specifies the
keyandvaluesthat must be met. If you want the new pod to be scheduled on the node you edited, use the samekeyandvaluesparameters as the label in the node. - Specifies an
operator. The operator can beIn,NotIn,Exists, orDoesNotExist. For example, use the operatorInto require the label to be in the node.
-
Create the pod.
$ oc create -f <file-name>.yaml
-
Sample node affinity rules
The following examples demonstrate node affinity.
Node affinity with matching labels
The following example demonstrates node affinity for a node and pod with matching labels:
-
The Node1 node has the label
zone:us:$ oc label node node1 zone=usTip
You can alternatively apply the following YAML to add the label:
kind: Node apiVersion: v1 metadata: name: <node_name> labels: zone: us #... -
The pod-s1 pod has the
zoneanduskey/value pair under a required node affinity rule:$ cat pod-s1.yamlExample outputapiVersion: v1 kind: Pod metadata: name: pod-s1 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "zone" operator: In values: - us #... -
The pod-s1 pod can be scheduled on Node1:
$ oc get pod -o wideExample outputNAME READY STATUS RESTARTS AGE IP NODE pod-s1 1/1 Running 0 4m IP1 node1
Node affinity with no matching labels
The following example demonstrates node affinity for a node and pod without matching labels:
-
The Node1 node has the label
zone:emea:$ oc label node node1 zone=emeaTip
You can alternatively apply the following YAML to add the label:
kind: Node apiVersion: v1 metadata: name: <node_name> labels: zone: emea #... -
The pod-s1 pod has the
zoneanduskey/value pair under a required node affinity rule:$ cat pod-s1.yamlExample outputapiVersion: v1 kind: Pod metadata: name: pod-s1 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "zone" operator: In values: - us #... -
The pod-s1 pod cannot be scheduled on Node1:
$ oc describe pod pod-s1Example output... Events: FirstSeen LastSeen Count From SubObjectPath Type Reason --------- -------- ----- ---- ------------- -------- ------ 1m 33s 8 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: MatchNodeSelector (1).
Using node affinity to control where an Operator is installed
By default, when you install an Operator, OpenShift Container Platform installs the Operator pod to one of your worker nodes randomly. However, there might be situations where you want that pod scheduled on a specific node or set of nodes.
The following examples describe situations where you might want to schedule an Operator pod to a specific node or set of nodes:
-
If an Operator requires a particular platform, such as
amd64orarm64 -
If an Operator requires a particular operating system, such as Linux or Windows
-
If you want Operators that work together scheduled on the same host or on hosts located on the same rack
-
If you want Operators dispersed throughout the infrastructure to avoid downtime due to network or hardware issues
You can control where an Operator pod is installed by adding a node affinity constraints to the Operator’s Subscription object.
The following examples show how to use node affinity to install an instance of the Custom Metrics Autoscaler Operator to a specific node in the cluster:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-custom-metrics-autoscaler-operator
namespace: openshift-keda
spec:
name: my-package
source: my-operators
sourceNamespace: operator-registries
config:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- ip-10-0-163-94.us-west-2.compute.internal
#...
- A node affinity that requires the Operator’s pod to be scheduled on a node named
ip-10-0-163-94.us-west-2.compute.internal.
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-custom-metrics-autoscaler-operator
namespace: openshift-keda
spec:
name: my-package
source: my-operators
sourceNamespace: operator-registries
config:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: kubernetes.io/os
operator: In
values:
- linux
#...
- A node affinity that requires the Operator’s pod to be scheduled on a node with the
kubernetes.io/arch=arm64andkubernetes.io/os=linuxlabels.
To control the placement of an Operator pod, complete the following steps:
-
Install the Operator as usual.
-
If needed, ensure that your nodes are labeled to properly respond to the affinity.
-
Edit the Operator
Subscriptionobject to add an affinity:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: openshift-custom-metrics-autoscaler-operator namespace: openshift-keda spec: name: my-package source: my-operators sourceNamespace: operator-registries config: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - ip-10-0-185-229.ec2.internal #...- Add a
nodeAffinity.
- Add a
-
To ensure that the pod is deployed on the specific node, run the following command:
$ oc get pods -o wideExample outputNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES custom-metrics-autoscaler-operator-5dcc45d656-bhshg 1/1 Running 0 50s 10.131.0.20 ip-10-0-185-229.ec2.internal <none> <none>