Virtual machine health checks
You can configure virtual machine (VM) health checks by defining readiness and liveness probes in the VirtualMachine resource.
About readiness and liveness probes
Use readiness and liveness probes to detect and handle unhealthy virtual machines (VMs). You can include one or more probes in the specification of the VM to ensure that traffic does not reach a VM that is not ready for it and that a new VM is created when a VM becomes unresponsive.
A readiness probe determines whether a VM is ready to accept service requests. If the probe fails, the VM is removed from the list of available endpoints until the VM is ready.
A liveness probe determines whether a VM is responsive. If the probe fails, the VM is deleted and a new VM is created to restore responsiveness.
You can configure readiness and liveness probes by setting the spec.readinessProbe and the spec.livenessProbe fields of the VirtualMachine object. These fields support the following tests:
- HTTP GET
-
The probe determines the health of the VM by using a web hook. The test is successful if the HTTP response code is between 200 and 399. You can use an HTTP GET test with applications that return HTTP status codes when they are completely initialized.
- TCP socket
-
The probe attempts to open a socket to the VM. The VM is only considered healthy if the probe can establish a connection. You can use a TCP socket test with applications that do not start listening until initialization is complete.
- Guest agent ping
-
The probe uses the
guest-pingcommand to determine if the QEMU guest agent is running on the virtual machine.
Defining an HTTP readiness probe
You can define an HTTP readiness probe by setting the spec.readinessProbe.httpGet field of the virtual machine (VM) configuration.
-
You have installed the OpenShift CLI (
oc).
-
Include details of the readiness probe in the VM configuration file.
Sample readiness probe with an HTTP GET test:
apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: annotations: name: fedora-vm namespace: example-namespace # ... spec: template: spec: readinessProbe: httpGet: port: 1500 path: /healthz httpHeaders: - name: Custom-Header value: Awesome initialDelaySeconds: 120 periodSeconds: 20 timeoutSeconds: 10 failureThreshold: 3 successThreshold: 3 # ...- The HTTP GET request to perform to connect to the VM.
- The port of the VM that the probe queries. In the above example, the probe queries port 1500.
- The path to access on the HTTP server. In the above example, if the handler for the server’s /healthz path returns a success code, the VM is considered to be healthy. If the handler returns a failure code, the VM is removed from the list of available endpoints.
- The time, in seconds, after the VM starts before the readiness probe is initiated.
- The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than
timeoutSeconds. - The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than
periodSeconds. - The number of times that the probe is allowed to fail. The default is 3. After the specified number of attempts, the pod is marked
Unready. - The number of times that the probe must report success, after a failure, to be considered successful. The default is 1.
-
Create the VM by running the following command:
$ oc create -f <file_name>.yaml
Defining a TCP readiness probe
You can define a TCP readiness probe by setting the spec.readinessProbe.tcpSocket field of the virtual machine (VM) configuration.
-
You have installed the OpenShift CLI (
oc).
-
Include details of the TCP readiness probe in the VM configuration file.
Sample readiness probe with a TCP socket test:
apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: annotations: name: fedora-vm namespace: example-namespace # ... spec: template: spec: readinessProbe: initialDelaySeconds: 120 periodSeconds: 20 tcpSocket: port: 1500 timeoutSeconds: 10 # ...- The time, in seconds, after the VM starts before the readiness probe is initiated.
- The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than
timeoutSeconds. - The TCP action to perform.
- The port of the VM that the probe queries.
- The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than
periodSeconds.
-
Create the VM by running the following command:
$ oc create -f <file_name>.yaml
Defining an HTTP liveness probe
Define an HTTP liveness probe by setting the spec.livenessProbe.httpGet field of the virtual machine (VM) configuration. You can define both HTTP and TCP tests for liveness probes in the same way as readiness probes. This procedure configures a sample liveness probe with an HTTP GET test.
-
You have installed the OpenShift CLI (
oc).
-
Include details of the HTTP liveness probe in the VM configuration file.
Sample liveness probe with an HTTP GET test:
apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: annotations: name: fedora-vm namespace: example-namespace # ... spec: template: spec: livenessProbe: initialDelaySeconds: 120 periodSeconds: 20 httpGet: port: 1500 path: /healthz httpHeaders: - name: Custom-Header value: Awesome timeoutSeconds: 10 # ...- The time, in seconds, after the VM starts before the liveness probe is initiated.
- The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than
timeoutSeconds. - The HTTP GET request to perform to connect to the VM.
- The port of the VM that the probe queries. In the above example, the probe queries port 1500. The VM installs and runs a minimal HTTP server on port 1500 via cloud-init.
- The path to access on the HTTP server. In the above example, if the handler for the server’s
/healthzpath returns a success code, the VM is considered to be healthy. If the handler returns a failure code, the VM is deleted and a new VM is created. - The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than
periodSeconds.
-
Create the VM by running the following command:
$ oc create -f <file_name>.yaml
Defining a watchdog
You can define a watchdog to monitor the health of the guest operating system by performing the following steps:
-
Configure a watchdog device for the virtual machine (VM).
-
Install the watchdog agent on the guest.
The watchdog device monitors the agent and performs one of the following actions if the guest operating system is unresponsive:
-
poweroff: The VM powers down immediately. Ifspec.runStrategyis not set tomanual, the VM reboots. -
reset: The VM reboots in place and the guest operating system cannot react.Note
The reboot time might cause liveness probes to time out. If cluster-level protections detect a failed liveness probe, the VM might be forcibly rescheduled, increasing the reboot time.
-
shutdown: The VM gracefully powers down by stopping all services.
Note
Watchdog is not available for Windows VMs.
Configuring a watchdog device for the virtual machine
You configure a watchdog device for the virtual machine (VM).
-
For
x86systems, the VM must use a kernel that works with thei6300esbwatchdog device. If you uses390xarchitecture, the kernel must be enabled fordiag288. Red Hat Enterprise Linux (RHEL) images supporti6300esbanddiag288. -
You have installed the OpenShift CLI (
oc).
-
Create a
YAMLfile with the following contents:apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: labels: kubevirt.io/vm: <vm-label> name: <vm-name> spec: runStrategy: Halted template: metadata: labels: kubevirt.io/vm: <vm-label> spec: domain: devices: watchdog: name: <watchdog> <watchdog-device-model>: action: "poweroff" # ...- The watchdog device model to use. For
x86specifyi6300esb. Fors390xspecifydiag288. - Specify
poweroff,reset, orshutdown. Theshutdownaction requires that the guest virtual machine is responsive to ACPI signals. Therefore, usingshutdownis not recommended.The example above configures the watchdog device on a VM with the
poweroffaction and exposes the device as/dev/watchdog.This device can now be used by the watchdog binary.
- The watchdog device model to use. For
-
Apply the YAML file to your cluster by running the following command:
$ oc apply -f <file_name>.yaml
Important
This procedure is provided for testing watchdog functionality only and must not be run on production machines.
-
Run the following command to verify that the VM is connected to the watchdog device:
$ lspci | grep watchdog -i -
Run one of the following commands to confirm the watchdog is active:
-
Trigger a kernel panic:
# echo c > /proc/sysrq-trigger -
Stop the watchdog service:
# pkill -9 watchdog
-
Installing the watchdog agent on the guest
You can install the watchdog agent on the guest and start the watchdog service.
-
Log in to the virtual machine as root user.
-
This step is only required when installing on IBM Z® (
s390x). Enablewatchdogby running the following command:# modprobe diag288_wdt -
Verify that the
/dev/watchdogfile path is present in the VM by running the following command:# ls /dev/watchdog -
Install the
watchdogpackage and its dependencies:# yum install watchdog -
Uncomment the following line in the
/etc/watchdog.conffile and save the changes:#watchdog-device = /dev/watchdog -
Enable the
watchdogservice to start on boot:# systemctl enable --now watchdog.service
Defining a guest agent ping probe
You can define a guest agent ping probe by setting the spec.readinessProbe.guestAgentPing field of the virtual machine (VM) configuration.
-
The QEMU guest agent must be installed and enabled on the virtual machine.
-
You have installed the OpenShift CLI (
oc).
-
Include details of the guest agent ping probe in the VM configuration file. For example:
apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: annotations: name: fedora-vm namespace: example-namespace # ... spec: template: spec: readinessProbe: guestAgentPing: {} initialDelaySeconds: 120 periodSeconds: 20 timeoutSeconds: 10 failureThreshold: 3 successThreshold: 3 # ...- The guest agent ping probe to connect to the VM.
- Optional: The time, in seconds, after the VM starts before the guest agent probe is initiated.
- Optional: The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than
timeoutSeconds. - Optional: The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than
periodSeconds. - Optional: The number of times that the probe is allowed to fail. The default is 3. After the specified number of attempts, the pod is marked
Unready. - Optional: The number of times that the probe must report success, after a failure, to be considered successful. The default is 1.
-
Create the VM by running the following command:
$ oc create -f <file_name>.yaml