Enabling descheduler evictions on virtual machines
You can use the descheduler to evict pods so that the pods can be rescheduled onto more appropriate nodes. If the pod is a virtual machine, the pod eviction causes the virtual machine to be live migrated to another node.
Descheduler profiles
Use descheduler profiles to enable specific eviction strategies that rebalance your cluster based on criteria such as pod lifecycle or node utilization.
Use the KubeVirtRelieveAndMigrate or LongLifecycle profile to enable the descheduler on a virtual machine.
Important
You cannot have both KubeVirtRelieveAndMigrate and LongLifeCycle enabled at the same time.
KubeVirtRelieveAndMigrate-
This profile is an enhanced version of the
LongLifeCycleprofile.The
KubeVirtRelieveAndMigrateprofile evicts pods from high-cost nodes to reduce overall resource expenses and enable workload migration. It also periodically rebalances workloads to help maintain similar spare capacity across nodes, which supports better handling of sudden workload spikes. Nodes can experience the following costs:-
Resource utilization: Increased resource pressure raises the overhead for running applications.
-
Node maintenance: A higher number of containers on a node increases resource consumption and maintenance costs.
The profile enables the
LowNodeUtilizationstrategy with theEvictionsInBackgroundalpha feature. The profile also exposes the following customization fields:-
devActualUtilizationProfile: Enables load-aware descheduling. -
devLowNodeUtilizationThresholds: Sets experimental thresholds for theLowNodeUtilizationstrategy. Do not use this field withdevDeviationThresholds. -
devDeviationThresholds: Treats nodes with below-average resource usage as underutilized to help redistribute workloads from overutilized nodes. Do not use this field withdevLowNodeUtilizationThresholds. Supported values are:Low(10%:10%),Medium(20%:20%),High(30%:30%),AsymmetricLow(0%:10%),AsymmetricMedium(0%:20%),AsymmetricHigh(0%:30%). -
devEnableSoftTainter: Enables the soft-tainting component to dynamically apply or remove soft taints as scheduling hints.
Example configurationapiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: managementState: Managed deschedulingIntervalSeconds: 30 mode: "Automatic" profiles: - KubeVirtRelieveAndMigrate profileCustomizations: devEnableSoftTainter: true devDeviationThresholds: AsymmetricLow devActualUtilizationProfile: PrometheusCPUCombinedThe
KubeVirtRelieveAndMigrateprofile requires PSI metrics to be enabled on all worker nodes. You can enable this by applying the followingMachineConfigcustom resource (CR):ExampleMachineConfigCRapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-openshift-machineconfig-worker-psi-karg spec: kernelArguments: - psi=1Note
The name of the
MachineConfigobject is significant because machine configs are processed in lexicographical order. By default, a config that starts with98-disables PSI. To ensure that PSI is enabled, name your config with a higher prefix, such as99-openshift-machineconfig-worker-psi-karg.You can use this profile with the
SoftTopologyAndDuplicatesprofile to also rebalance pods based on soft topology constraints, which can be useful in hosted control plane environments. -
LongLifecycle-
This profile balances resource usage between nodes and enables the following strategies:
-
RemovePodsHavingTooManyRestarts: removes pods whose containers have been restarted too many times and pods where the sum of restarts over all containers (including Init Containers) is more than 100. Restarting the VM guest operating system does not increase this count. -
LowNodeUtilization: evicts pods from overutilized nodes when there are any underutilized nodes. The destination node for the evicted pod will be determined by the scheduler.-
A node is considered underutilized if its usage is below 20% for all thresholds (CPU, memory, and number of pods).
-
A node is considered overutilized if its usage is above 50% for any of the thresholds (CPU, memory, and number of pods).
-
-
Installing the descheduler
The descheduler is not available by default. To enable the descheduler, you must install the Kube Descheduler Operator from the software catalog and enable one or more descheduler profiles.
By default, the descheduler runs in predictive mode, which means that it only simulates pod evictions. You must change the mode to automatic for the descheduler to perform the pod evictions.
Important
If you have enabled hosted control planes in your cluster, set a custom priority threshold to lower the chance that pods in the hosted control plane namespaces are evicted. Set the priority threshold class name to hypershift-control-plane, because it has the lowest priority value (100000000) of the hosted control plane priority classes.
-
You are logged in to OpenShift Container Platform as a user with the
cluster-adminrole. -
Access to the OpenShift Container Platform web console.
-
Log in to the OpenShift Container Platform web console.
-
Create the required namespace for the Kube Descheduler Operator.
-
Navigate to Administration → Namespaces and click Create Namespace.
-
Enter
openshift-kube-descheduler-operatorin the Name field, enteropenshift.io/cluster-monitoring=truein the Labels field to enable descheduler metrics, and click Create.
-
-
Install the Kube Descheduler Operator.
-
Navigate to Ecosystem → Software Catalog.
-
Type Kube Descheduler Operator into the filter box.
-
Select the Kube Descheduler Operator and click Install.
-
On the Install Operator page, select A specific namespace on the cluster. Select openshift-kube-descheduler-operator from the drop-down menu.
-
Adjust the values for the Update Channel and Approval Strategy to the desired values.
-
Click Install.
-
-
Create a descheduler instance.
-
From the Ecosystem → Installed Operators page, click the Kube Descheduler Operator.
-
Select the Kube Descheduler tab and click Create KubeDescheduler.
-
Edit the settings as necessary.
-
To evict pods instead of simulating the evictions, change the Mode field to Automatic.
-
Expand the Profiles section and select
LongLifecycle. TheAffinityAndTaintsprofile is enabled by default.Important
The only profile currently available for OpenShift Virtualization is
LongLifecycle.You can also configure the profiles and settings for the descheduler later using the OpenShift CLI (
oc).
-
-
Configuring descheduler evictions for virtual machines
After the descheduler is installed and configured, all migratable virtual machines (VMs) are eligible for eviction by default. You can configure the descheduler to manage VM evictions across the cluster and optionally exclude specific VMs from eviction.
-
Install the descheduler in the OpenShift Container Platform web console or OpenShift CLI (
oc).
-
Stop the VM.
-
Configure the
KubeDeschedulerobject with theKubeVirtRelieveAndMigrateprofile and enable background evictions for improved VM eviction stability during live migration:apiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: deschedulingIntervalSeconds: 60 profiles: - KubeVirtRelieveAndMigrate mode: Automatic -
Optional: To evict pods, set the
modefield value toAutomatic. By default, the descheduler does not evict pods. -
Optional: Configure limits for the number of parallel evictions to improve stability in large clusters.
The descheduler can limit the number of concurrent evictions per node and across the cluster by using the
evictionLimitsfield. Set these limits to match the migration limits configured in theHyperConvergedcustom resource (CR).spec: evictionLimits: node: 2 total: 5Set values that correspond to the migration limits in the
HyperConvergedCR:spec: liveMigrationConfig: parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 -
Optional: To exclude the VM from eviction, add the
descheduler.alpha.kubernetes.io/prefer-no-evictionannotation to thespec.template.metadata.annotationsfield. The change is applied dynamically and is propagated to theVirtualMachineInstance(VMI) object and thevirt-launcherpod.Only the presence of the annotation is checked. The value is not evaluated, so
"true"and"false"have the same effect.apiVersion: kubevirt.io/v1 kind: VirtualMachine spec: template: metadata: annotations: descheduler.alpha.kubernetes.io/prefer-no-eviction: "true" -
Start the VM.
The VM is now configured according to the descheduler settings.