Troubleshooting operating system issues
OpenShift Container Platform runs on RHCOS. You can follow these procedures to troubleshoot problems related to the operating system.
Investigating kernel crashes
The kdump service, included in the kexec-tools package, provides a crash-dumping mechanism. You can use this service to save the contents of a system’s memory for later analysis.
Enabling kdump
RHCOS ships with the kexec-tools package, but manual configuration is required to enable the kdump service.
-
To reserve memory for the crash kernel during the first kernel booting, provide kernel arguments by entering the following command:
# rpm-ostree kargs --append='crashkernel=256M'Note
For the
ppc64leplatform, the recommended value forcrashkerneliscrashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G. -
Optional: To write the crash dump over the network or to some other location, rather than to the default local
/var/crashlocation, edit the/etc/kdump.confconfiguration file.Note
If your node uses LUKS-encrypted devices, you must use network dumps as kdump does not support saving crash dumps to LUKS-encrypted devices.
For details on configuring the
kdumpservice, see the comments in/etc/sysconfig/kdump,/etc/kdump.conf, and thekdump.confmanual page.Important
If you have multipathing enabled on your primary disk, the dump target must be either an NFS or SSH server and you must exclude the multipath module from your
/etc/kdump.confconfiguration file. -
Enable the
kdumpsystemd service.# systemctl enable kdump.service -
Reboot your system.
# systemctl reboot -
Ensure that kdump has loaded a crash kernel by checking that the
kdump.servicesystemd service has started and exited successfully and that the command,cat /sys/kernel/kexec_crash_loaded, prints the value1.
Enabling kdump on day-1
The kdump service is intended to be enabled per node to debug kernel problems. Because there are costs to having kdump enabled, and these costs accumulate with each additional kdump-enabled node, it is recommended that the kdump service only be enabled on each node as needed. Potential costs of enabling the kdump service on each node include:
-
Less available RAM due to memory being reserved for the crash kernel.
-
Node unavailability while the kernel is dumping the core.
-
Additional storage space being used to store the crash dumps.
If you are aware of the downsides and trade-offs of having the kdump service enabled, it is possible to enable kdump in a cluster-wide fashion. Although machine-specific machine configs are not yet supported, you can use a systemd unit in a MachineConfig object as a day-1 customization and have kdump enabled on all nodes in the cluster. You can create a MachineConfig object and inject that object into the set of manifest files used by Ignition during cluster setup.
Note
See "Customizing nodes" in the Installing → Installation configuration section for more information and examples on how to use Ignition configs.
-
Create a Butane config file,
99-worker-kdump.bu, that configures and enables kdump. This creates aMachineConfigobject for cluster-wide configuration:Note
The Butane version you specify in the config file should match the OpenShift Container Platform version and always ends in
0. For example,4.19.0. See "Creating machine configs with Butane" for information about Butane.variant: openshift version: 4.19.0 metadata: name: 99-worker-kdump labels: machineconfiguration.openshift.io/role: worker openshift: kernel_arguments: - crashkernel=256M storage: files: - path: /etc/kdump.conf mode: 0644 overwrite: true contents: inline: | path /var/crash core_collector makedumpfile -l --message-level 7 -d 31 - path: /etc/sysconfig/kdump mode: 0644 overwrite: true contents: inline: | KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb" KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable" KEXEC_ARGS="-s" KDUMP_IMG="vmlinuz" systemd: units: - name: kdump.service enabled: true- where
-
-
Replace
workerwithmasterin both locations when creating aMachineConfigobject for control plane nodes. -
Provide kernel arguments to reserve memory for the crash kernel. You can add other kernel arguments if necessary. For the
ppc64leplatform, the recommended value forcrashkerneliscrashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G. -
If you want to change the contents of
/etc/kdump.conffrom the default, include this section and modify theinlinesubsection accordingly. -
If you want to change the contents of
/etc/sysconfig/kdumpfrom the default, include this section and modify theinlinesubsection accordingly. -
For the
ppc64leplatform, replacenr_cpus=1withmaxcpus=1, which is not supported on this platform.
-
Example /etc/kdump.conf file
To export the dumps to NFS targets, some kernel modules must be explicitly added to the configuration file:
nfs server.example.com:/export/cores
core_collector makedumpfile -l --message-level 7 -d 31
extra_bins /sbin/mount.nfs
extra_modules nfs nfsv3 nfs_layout_nfsv41_files blocklayoutdriver nfs_layout_flexfiles nfs_layout_nfsv41_files
-
Use Butane to generate a machine config YAML file,
99-worker-kdump.yaml, containing the configuration to be delivered to the nodes:$ butane 99-worker-kdump.bu -o 99-worker-kdump.yaml -
Put the YAML file into the
<installation_directory>/manifests/directory during cluster setup. You can also create thisMachineConfigobject after cluster setup with the YAML file:$ oc create -f 99-worker-kdump.yaml
Testing the kdump configuration
Analyzing a core dump
Note
It is recommended to perform vmcore analysis on a separate RHEL system.
Debugging Ignition failures
If a machine cannot be provisioned, Ignition fails and RHCOS will boot into the emergency shell. Use the following procedure to get debugging information.
-
Run the following command to show which service units failed:
$ systemctl --failed -
Optional: Run the following command on an individual service unit to find out more information:
$ journalctl -u <unit>.service