Descheduler overview
While the scheduler is used to determine the most suitable node to host a new pod, the descheduler can be used to evict a running pod so that the pod can be rescheduled onto a more suitable node.
About the descheduler
You can use the descheduler to evict pods based on specific strategies so that the pods are rescheduled onto more appropriate nodes.
You can benefit from descheduling running pods in situations such as the following:
-
Nodes are underutilized or overutilized.
-
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
-
Node failure requires pods to be moved.
-
New nodes are added to clusters.
-
Pods have been restarted too many times.
Important
The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.
When the descheduler decides to evict pods from a node, it employs the following general mechanism:
-
Pods in the
openshift-*andkube-systemnamespaces are never evicted. -
Critical pods with
priorityClassNameset tosystem-cluster-criticalorsystem-node-criticalare never evicted. -
Static, mirrored, or standalone pods that are not part of a replication controller, replica set, deployment, stateful set, or job are never evicted because these pods are not recreated.
-
Pods associated with daemon sets are never evicted.
-
Pods with local storage are never evicted.
-
Best effort pods are evicted before burstable and guaranteed pods.
-
All types of pods with the
descheduler.alpha.kubernetes.io/evictannotation are eligible for eviction. This annotation is used to override checks that prevent eviction, and the user can select which pod is evicted. Users should know how and if the pod will be recreated. -
Pods subject to pod disruption budget (PDB) are not evicted if descheduling violates its pod disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.
Descheduler profiles
Use descheduler profiles to enable specific eviction strategies that rebalance your cluster based on criteria such as pod lifecycle or node utilization.
The following descheduler profiles are available:
AffinityAndTaints-
This profile evicts pods that violate inter-pod anti-affinity, node affinity, and node taints.
It enables the following strategies:
-
RemovePodsViolatingInterPodAntiAffinity: removes pods that are violating inter-pod anti-affinity. -
RemovePodsViolatingNodeAffinity: removes pods that are violating node affinity. -
RemovePodsViolatingNodeTaints: removes pods that are violatingNoScheduletaints on nodes.Pods with a node affinity type of
requiredDuringSchedulingIgnoredDuringExecutionare removed.
-
TopologyAndDuplicates-
This profile evicts pods in an effort to evenly spread similar pods, or pods of the same topology domain, among nodes.
It enables the following strategies:
-
RemovePodsViolatingTopologySpreadConstraint: finds unbalanced topology domains and tries to evict pods from larger ones whenDoNotScheduleconstraints are violated. -
RemoveDuplicates: ensures that there is only one pod associated with a replica set, replication controller, deployment, or job running on same node. If there are more, those duplicate pods are evicted for better pod distribution in a cluster.
Warning
Do not enable
TopologyAndDuplicateswith any of the following profiles:SoftTopologyAndDuplicatesorCompactAndScale. Enabling these profiles together results in a conflict. -
LifecycleAndUtilization-
This profile evicts long-running pods and balances resource usage between nodes.
It enables the following strategies:
-
RemovePodsHavingTooManyRestarts: removes pods whose containers have been restarted too many times.Pods where the sum of restarts over all containers (including Init Containers) is more than 100.
-
LowNodeUtilization: finds nodes that are underutilized and evicts pods, if possible, from overutilized nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes.-
A node is considered underutilized if its usage is below 20% for all thresholds (CPU, memory, and number of pods).
-
A node is considered overutilized if its usage is above 50% for any of the thresholds (CPU, memory, and number of pods).
Optionally, you can adjust these underutilized/overutilized threshold percentages by setting the Technology Preview field
devLowNodeUtilizationThresholdsto one the following values:Lowfor 10%/30%,Mediumfor 20%/50%, orHighfor 40%/70%. The default value isMedium. -
-
PodLifeTime: evicts pods that are too old.By default, pods that are older than 24 hours are removed. You can customize the pod lifetime value.
Warning
Do not enable
LifecycleAndUtilizationwith any of the following profiles:LongLifecycleorCompactAndScale. Enabling these profiles together results in a conflict. -
SoftTopologyAndDuplicates-
This profile is the same as
TopologyAndDuplicates, except that pods with soft topology constraints, such aswhenUnsatisfiable: ScheduleAnyway, are also considered for eviction.Warning
Do not enable both
SoftTopologyAndDuplicatesandTopologyAndDuplicates. Enabling both results in a conflict. EvictPodsWithLocalStorage-
This profile allows pods with local storage to be eligible for eviction.
EvictPodsWithPVC-
This profile allows pods with persistent volume claims to be eligible for eviction. If you are using
Kubernetes NFS Subdir External Provisioner, you must add an excluded namespace for the namespace where the provisioner is installed. CompactAndScale-
This profile enables the
HighNodeUtilizationstrategy, which attempts to evict pods from underutilized nodes to allow a workload to run on a smaller set of nodes. A node is considered underutilized if its usage is below 20% for all thresholds (CPU, memory, and number of pods).Optionally, you can adjust the underutilized percentage by setting the Technology Preview field
devHighNodeUtilizationThresholdsto one the following values:Minimalfor 10%,Modestfor 20%, orModeratefor 30%. The default value isModest.Warning
Do not enable
CompactAndScalewith any of the following profiles:LifecycleAndUtilization,LongLifecycle, orTopologyAndDuplicates. Enabling these profiles together results in a conflict. KubeVirtRelieveAndMigrate-
This profile is an enhanced version of the
LongLifeCycleprofile.The
KubeVirtRelieveAndMigrateprofile evicts pods from high-cost nodes to reduce overall resource expenses and enable workload migration. It also periodically rebalances workloads to help maintain similar spare capacity across nodes, which supports better handling of sudden workload spikes. Nodes can experience the following costs:-
Resource utilization: Increased resource pressure raises the overhead for running applications.
-
Node maintenance: A higher number of containers on a node increases resource consumption and maintenance costs.
The profile enables the
LowNodeUtilizationstrategy with theEvictionsInBackgroundalpha feature. The profile also exposes the following customization fields:-
devActualUtilizationProfile: Enables load-aware descheduling. -
devLowNodeUtilizationThresholds: Sets experimental thresholds for theLowNodeUtilizationstrategy. Do not use this field withdevDeviationThresholds. -
devDeviationThresholds: Treats nodes with below-average resource usage as underutilized to help redistribute workloads from overutilized nodes. Do not use this field withdevLowNodeUtilizationThresholds. Supported values are:Low(10%:10%),Medium(20%:20%),High(30%:30%),AsymmetricLow(0%:10%),AsymmetricMedium(0%:20%),AsymmetricHigh(0%:30%). -
devEnableSoftTainter: Enables the soft-tainting component to dynamically apply or remove soft taints as scheduling hints.
Example configurationapiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: managementState: Managed deschedulingIntervalSeconds: 30 mode: "Automatic" profiles: - KubeVirtRelieveAndMigrate profileCustomizations: devEnableSoftTainter: true devDeviationThresholds: AsymmetricLow devActualUtilizationProfile: PrometheusCPUCombinedThe
KubeVirtRelieveAndMigrateprofile requires PSI metrics to be enabled on all worker nodes. You can enable this by applying the followingMachineConfigcustom resource (CR):ExampleMachineConfigCRapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-openshift-machineconfig-worker-psi-karg spec: kernelArguments: - psi=1Note
The name of the
MachineConfigobject is significant because machine configs are processed in lexicographical order. By default, a config that starts with98-disables PSI. To ensure that PSI is enabled, name your config with a higher prefix, such as99-openshift-machineconfig-worker-psi-karg.You can use this profile with the
SoftTopologyAndDuplicatesprofile to also rebalance pods based on soft topology constraints, which can be useful in hosted control plane environments. -
LongLifecycle-
This profile balances resource usage between nodes and enables the following strategies:
-
RemovePodsHavingTooManyRestarts: removes pods whose containers have been restarted too many times and pods where the sum of restarts over all containers (including Init Containers) is more than 100. Restarting the VM guest operating system does not increase this count. -
LowNodeUtilization: evicts pods from overutilized nodes when there are any underutilized nodes. The destination node for the evicted pod will be determined by the scheduler.-
A node is considered underutilized if its usage is below 20% for all thresholds (CPU, memory, and number of pods).
-
A node is considered overutilized if its usage is above 50% for any of the thresholds (CPU, memory, and number of pods).
-
Warning
Do not enable
LongLifecyclewith any of the following profiles:LifecycleAndUtilizationorCompactAndScale. Enabling these profiles together results in a conflict. -