Configuring an OpenShift Container Platform cluster for pods
As an administrator, you can create and maintain an efficient cluster for pods.
By keeping your cluster efficient, you can provide a better environment for your developers using such tools as what a pod does when it exits, ensuring that the required number of pods is always running, when to restart pods designed to run only once, limit the bandwidth available to pods, and how to keep pods running during disruptions.
Configuring how pods behave after restart
A pod restart policy determines how OpenShift Container Platform responds when Containers in that pod exit. The policy applies to all Containers in that pod.
The possible values are:
-
Always- Tries restarting a successfully exited Container on the pod continuously, with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes. The default isAlways. -
OnFailure- Tries restarting a failed Container on the pod with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes. -
Never- Does not try to restart exited or failed Containers on the pod. Pods immediately fail and exit.
After the pod is bound to a node, the pod will never be bound to another node. This means that a controller is necessary in order for a pod to survive node failure:
| Condition | Controller Type | Restart Policy |
|---|---|---|
Pods that are expected to terminate (such as batch computations) |
Job |
|
Pods that are expected to not terminate (such as web servers) |
Replication controller |
|
Pods that must run one-per-machine |
Daemon set |
Any |
If a Container on a pod fails and the restart policy is set to OnFailure, the pod stays on the node and the Container is restarted. If you do not want the Container to
restart, use a restart policy of Never.
If an entire pod fails, OpenShift Container Platform starts a new pod. Developers must address the possibility that applications might be restarted in a new pod. In particular, applications must handle temporary files, locks, incomplete output, and so forth caused by previous runs.
Note
Kubernetes architecture expects reliable endpoints from cloud providers. When a cloud provider is down, the kubelet prevents OpenShift Container Platform from restarting.
If the underlying cloud provider endpoints are not reliable, do not install a cluster using cloud provider integration. Install the cluster as if it was in a no-cloud environment. It is not recommended to toggle cloud provider integration on or off in an installed cluster.
For details on how OpenShift Container Platform uses restart policy with failed Containers, see the Example States in the Kubernetes documentation.
Limiting the bandwidth available to pods
You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. The limits you place on a pod do not affect the bandwidth of other pods.
To limit the bandwidth on a pod:
-
Write an object definition JSON file, and specify the data traffic speed using
kubernetes.io/ingress-bandwidthandkubernetes.io/egress-bandwidthannotations. For example, to limit both pod egress and ingress bandwidth to 10M/s:LimitedPodobject definition{ "kind": "Pod", "spec": { "containers": [ { "image": "openshift/hello-openshift", "name": "hello-openshift" } ] }, "apiVersion": "v1", "metadata": { "name": "iperf-slow", "annotations": { "kubernetes.io/ingress-bandwidth": "10M", "kubernetes.io/egress-bandwidth": "10M" } } } -
Create the pod using the object definition:
$ oc create -f <file_or_dir_path>
Understanding how to use pod disruption budgets to specify the number of pods that must be up
A pod disruption budget allows the specification of safety constraints on pods during operations, such as draining a node for maintenance.
PodDisruptionBudget is an API object that specifies the minimum number or
percentage of replicas that must be up at a time. Setting these in projects can
be helpful during node maintenance (such as scaling a cluster down or a cluster
upgrade) and is only honored on voluntary evictions (not on node failures).
A PodDisruptionBudget object’s configuration consists of the following key
parts:
-
A label selector, which is a label query over a set of pods.
-
An availability level, which specifies the minimum number of pods that must be available simultaneously, either:
-
minAvailableis the number of pods must always be available, even during a disruption. -
maxUnavailableis the number of pods can be unavailable during a disruption.
-
Note
Available refers to the number of pods that has condition Ready=True.
Ready=True refers to the pod that is able to serve requests and should be added to the load balancing pools of all matching services.
A maxUnavailable of 0% or 0 or a minAvailable of 100% or equal to the number of replicas
is permitted but can block nodes from being drained.
Warning
The default setting for maxUnavailable is 1 for all the machine config pools in OpenShift Container Platform. It is recommended to not change this value and update one control plane node at a time. Do not change this value to 3 for the control plane pool.
You can check for pod disruption budgets across all projects with the following:
$ oc get poddisruptionbudget --all-namespaces
Note
The following example contains some values that are specific to OpenShift Container Platform on AWS.
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
openshift-apiserver openshift-apiserver-pdb N/A 1 1 121m
openshift-cloud-controller-manager aws-cloud-controller-manager 1 N/A 1 125m
openshift-cloud-credential-operator pod-identity-webhook 1 N/A 1 117m
openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-pdb N/A 1 1 121m
openshift-cluster-storage-operator csi-snapshot-controller-pdb N/A 1 1 122m
openshift-cluster-storage-operator csi-snapshot-webhook-pdb N/A 1 1 122m
openshift-console console N/A 1 1 116m
#...
The PodDisruptionBudget is considered healthy when there are at least
minAvailable pods running in the system. Every pod above that limit can be evicted.
Note
Depending on your pod priority and preemption settings, lower-priority pods might be removed despite their pod disruption budget requirements.
Specifying the number of pods that must be up with pod disruption budgets
You can use a PodDisruptionBudget object to specify the minimum number or
percentage of replicas that must be up at a time.
To configure a pod disruption budget:
-
Create a YAML file with the an object definition similar to the following:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-pdb spec: minAvailable: 2 selector: matchLabels: name: my-podPodDisruptionBudgetis part of thepolicy/v1API group.- The minimum number of pods that must be available simultaneously. This can
be either an integer or a string specifying a percentage, for example,
20%. - A label query over a set of resources. The result of
matchLabelsandmatchExpressionsare logically conjoined. Leave this parameter blank, for exampleselector {}, to select all pods in the project.Or:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-pdb spec: maxUnavailable: 25% selector: matchLabels: name: my-pod PodDisruptionBudgetis part of thepolicy/v1API group.- The maximum number of pods that can be unavailable simultaneously. This can
be either an integer or a string specifying a percentage, for example,
20%. - A label query over a set of resources. The result of
matchLabelsandmatchExpressionsare logically conjoined. Leave this parameter blank, for exampleselector {}, to select all pods in the project.
-
Run the following command to add the object to project:
$ oc create -f </path/to/file> -n <project_name>
Specifying the eviction policy for unhealthy pods
When you use pod disruption budgets (PDBs) to specify how many pods must be available simultaneously, you can also define the criteria for how unhealthy pods are considered for eviction.
You can choose one of the following policies:
- IfHealthyBudget
-
Running pods that are not yet healthy can be evicted only if the guarded application is not disrupted.
- AlwaysAllow
-
Running pods that are not yet healthy can be evicted regardless of whether the criteria in the pod disruption budget is met. This policy can help evict malfunctioning applications, such as ones with pods stuck in the
CrashLoopBackOffstate or failing to report theReadystatus.Note
It is recommended to set the
unhealthyPodEvictionPolicyfield toAlwaysAllowin thePodDisruptionBudgetobject to support the eviction of misbehaving applications during a node drain. The default behavior is to wait for the application pods to become healthy before the drain can proceed.
-
Create a YAML file that defines a
PodDisruptionBudgetobject and specify the unhealthy pod eviction policy:Examplepod-disruption-budget.yamlfileapiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-pdb spec: minAvailable: 2 selector: matchLabels: name: my-pod unhealthyPodEvictionPolicy: AlwaysAllow- Choose either
IfHealthyBudgetorAlwaysAllowas the unhealthy pod eviction policy. The default isIfHealthyBudgetwhen theunhealthyPodEvictionPolicyfield is empty.
- Choose either
-
Create the
PodDisruptionBudgetobject by running the following command:$ oc create -f pod-disruption-budget.yaml
With a PDB that has the AlwaysAllow unhealthy pod eviction policy set, you can now drain nodes and evict the pods for a malfunctioning application guarded by this PDB.
Preventing pod removal using critical pods
There are a number of core components that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster might stop working properly if a critical add-on is evicted.
Pods marked as critical are not allowed to be evicted.
To make a pod critical:
-
Create a
Podspec or edit existing pods to include thesystem-cluster-criticalpriority class:apiVersion: v1 kind: Pod metadata: name: my-pdb spec: template: metadata: name: critical-pod priorityClassName: system-cluster-critical # ...- Default priority class for pods that should never be evicted from a node.
Alternatively, you can specify
system-node-criticalfor pods that are important to the cluster but can be removed if necessary.
- Default priority class for pods that should never be evicted from a node.
-
Create the pod:
$ oc create -f <file-name>.yaml
Reducing pod timeouts when using persistent volumes with high file counts
If a storage volume contains many files (~1,000,000 or greater), you might experience pod timeouts.
This can occur because, when volumes are mounted, OpenShift Container Platform recursively changes the ownership and permissions of the contents of each volume in order to match the fsGroup specified in a pod’s securityContext. For large volumes, checking and changing the ownership and permissions can be time consuming, resulting in a very slow pod startup.
You can reduce this delay by applying one of the following workarounds:
-
Use a security context constraint (SCC) to skip the SELinux relabeling for a volume.
-
Use the
fsGroupChangePolicyfield inside an SCC to control the way that OpenShift Container Platform checks and manages ownership and permissions for a volume. -
Use the Cluster Resource Override Operator to automatically apply an SCC to skip the SELinux relabeling.
-
Use a runtime class to skip the SELinux relabeling for a volume.