Post-installation troubleshooting and recovery
Important
Two-node OpenShift cluster with fencing is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Use the following sections help you with recovering from issues in a two-node OpenShift cluster with fencing.
Manually recovering from a disruption event when automated recovery is unavailable
You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are four main recovery scenarios, which should be attempted in the following order:
-
Update fencing secrets: Refresh the Baseboard Management Console (BMC) credentials if they are incorrect or outdated.
-
Recover from a single-node failure: Restore functionality when only one control plane node is down.
-
Recover from a complete node failure: Restore functionality when both control plane nodes are down.
-
Replace a control plane node that cannot be recovered: Replace the node to restore cluster functionality.
-
You have administrative access to the control plane nodes.
-
You can connect to the nodes by using SSH.
Note
Do an etcd backup before proceeding to ensure that you can restore the cluster if any issues occur.
-
Update the fencing secrets:
-
If the Cluster API is unavilable, update fencing secret by running the following command on one of the cluster nodes:
$ sudo pcs stonith update <node_name>_redfish username=<user_name> password=<password>After the Cluster API recovers, or the Cluster API is already available, update fencing secret in the cluster to ensure it stays in sync, as described in the following step.
-
Edit the username and password for the existing fencing secret for the control plane node by running the following commads:
$ oc project openshift-etcd$ oc edit secret <node_name>-fencingIf the cluster recovers after updating the fencing secrets, no further action is required. If the issue persists, proceed to the next step.
-
-
Recover from a single-node failure:
-
Gather initial diagnostics by running the following command:
$ sudo pcs status --fullThis command provides a detailed view of the current cluster and resource states. You can use the output to identify issues with fencing or etcd startup.
-
Run the following additional diagnostic commands, if necessary:
Reset the resources on your cluster and instruct Pacemaker to attempt to start them fresh by running the following command:
$ sudo pcs resource cleanupReview all Pacemaker activity on the node by running the following command:
$ sudo journalctl -u pacemakerDiagnose etcd resource startup issues by running the following command:
$ sudo journalctl -u pacemaker | grep podman-etcd -
View the fencing configuration for the node by running the following command:
$ sudo pcs stonith config <node_name>_redfishIf fencing is required but is not functioning, ensure that the Redfish fencing endpoint is accessible and verify that the credentials are correct.
-
If etcd is not starting despite fencing being operational, restore etcd from a backup by running the following commands:
$ sudo cp -r /var/lib/etcd-backup/* /var/lib/etcd/$ sudo chown -R etcd:etcd /var/lib/etcdIf the recovery is successful, no further action is required. If the issue persists, proceed to the next step.
-
-
Recover from a complete node failure:
-
Power on both control plane nodes.
Pacemaker starts automatically and begins the recovery operation when it detects both nodes are online. If the recovery does not start as expected, use the diagnostic commands described in the previous step to investigate the issue.
-
Reset the resources on your cluster and instruct Pacemaker to attempt to start them fresh by running the following command:
$ sudo pcs resource cleanup -
Check resource start order by running the following command:
$ sudo pcs status --full -
Inspect the pacemaker service journal if kubelet fails by running the following commands:
$ sudo journalctl -u pacemaker$ sudo journalctl -u kubelet -
Handle out-of-sync etcd.
If one node has a more up-to-date etcd, Pacemaker attempts to fence the lagging node and start it as a learner. If this process stalls, verify the Redfish fencing endpoint and credentials by running the following command:
$ sudo pcs stonith configIf the recovery is successful, no further action is required. If the issue persists, perform manual recovery as described in the next step.
-
-
If you need to manually recover from an event when one of the nodes is not recoverable, follow the procedure in "Replacing control plane nodes in a two-node OpenShift cluster".
When a cluster loses a single node, it enters the degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
If both nodes fail, you must restart both nodes to reestablish quorum so that Pacemaker can resume normal cluster operations.
If only one of the two nodes can be restarted, follow the node replacement procedure to manually reestablish quorum on the surviving node.
If manual recovery is still required and it fails, collect a must-gather and SOS report, and file a bug.
For information about verifying that both control plane nodes and etcd are operating correctly, see "Verifying etcd health in a two-node OpenShift cluster with fencing".
Replacing control plane nodes in a two-node OpenShift cluster with fencing
You can replace a failed control plane node in a two-node OpenShift cluster. The replacement node must use the same host name and IP address as the failed node.
-
You have a functioning survivor control plane node.
-
You have verified that either the machine is not running or the node is not ready.
-
You have access to the cluster as a user with the
cluster-adminrole. -
You know the host name and IP address of the failed node.
Note
Do an etcd backup before proceeding to ensure that you can restore the cluster if any issues occur.
-
Check the quorum state by running the following command:
$ sudo pcs quorum statusExample outputQuorum information ------------------ Date: Fri Oct 3 14:15:31 2025 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1.16 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 1 Flags: 2Node Quorate WaitForAll Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR master-0 (local) 2 1 NR master-1-
If quorum is lost and one control plane node is still running, restore quorum manually on the survivor node by running the following command:
$ sudo pcs quorum unblock -
If only one node failed, verify that etcd is running on the survivor node by running the following command:
$ sudo pcs resource status etcd -
If etcd is not running, restart etcd by running the following command:
$ sudo pcs resource cleanup etcdIf etcd still does not start, force it manually on the survivor node, skipping fencing:
Important
Before running this commands, ensure that the node being replaced is inaccessible. Otherwise, you risk etcd corruption.
$ sudo pcs resource debug-stop etcd$ sudo OCF_RESKEY_CRM_meta_notify_start_resource='etcd' pcs resource debug-start etcdAfter recovery, etcd must be running successfully on the survivor node.
-
-
Delete etcd secrets for the failed node by running the following commands:
$ oc project openshift-etcd$ oc delete secret etcd-peer-<node_name>$ oc delete secret etcd-serving-<node_name>$ oc delete secret etcd-serving-metrics-<node_name>Note
To replace the failed node, you must delete its etcd secrets first. When etcd is running, it might take some time for the API server to respond to these commands.
-
Delete resources for the failed node:
-
If you have the
BareMetalHost(BMH) objects, list them to identify the host you are replacing by running the following command:$ oc get bmh -n openshift-machine-api -
Delete the BMH object for the failed node by running the following command:
$ oc delete bmh/<bmh_name> -n openshift-machine-api -
List the
Machineobjects to identify the object that maps to the node that you are replacing by running the following command:$ oc get machines.machine.openshift.io -n openshift-machine-api -
Get the label with the machine hash value from the
Machineobject by running the following command:$ oc get machines.machine.openshift.io/<machine_name> -n openshift-machine-api \ -o jsonpath='Machine hash label: {.metadata.labels.machine\.openshift\.io/cluster-api-cluster}{"\n"}'Replace
<machine_name>with the name of aMachineobject in your cluster. For example,ostest-bfs7w-ctrlplane-0.You need this label to provision a new
Machineobject. -
Delete the
Machineobject for the failed node by running the following command:$ oc delete machines.machine.openshift.io/<machine_name>-<failed nodename> -n openshift-machine-apiNote
The node object is deleted automatically after deleting the
Machineobject.
-
-
Recreate the failed host by using the same name and IP address:
Important
You must perform this step only if you are using installer-provisioned infrastructure or the Machine API to create the original node. For information about replacing a failed bare-metal control plane node, see "Replacing an unhealthy etcd member on bare metal".
-
Remove the BMH and
Machineobjects. The machine controller automatically deletes the node object. -
Provision a new machine by using the following sample configuration:
ExampleMachineobject configurationapiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: metal3.io/BareMetalHost: openshift-machine-api/{bmh_name} finalizers: - machine.machine.openshift.io labels: machine.openshift.io/cluster-api-cluster: {machine_hash_label} machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master name: {machine_name} namespace: openshift-machine-api spec: authoritativeAPI: MachineAPI metadata: {} providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 customDeploy: method: install_coreos hostSelector: {} image: checksum: "" url: "" kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: master-user-data-managed-
metadata.annotations.metal3.io/BareMetalHost: Replace{bmh_name}with the name of the BMH object that is associated with the host that you are replacing. -
labels.machine.openshift.io/cluster-api-cluster: Replace{machine_hash_label}with the label that you fetched from the machine you deleted. -
metadata.name: Replace{machine_name}with the name of the machine you deleted.
-
-
Create the new BMH object and the secret to store the BMC credentials by running the following command:
cat <<EOF | oc apply -f - apiVersion: v1 kind: Secret metadata: name: <secret_name> namespace: openshift-machine-api data: password: <password> username: <username> type: Opaque --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: {bmh_name} namespace: openshift-machine-api spec: automatedCleaningMode: disabled bmc: address: <redfish_url>/{uuid} credentialsName: <name> disableCertificateVerification: true bootMACAddress: {boot_mac_address} bootMode: UEFI externallyProvisioned: false online: true rootDeviceHints: deviceName: /dev/disk/by-id/scsi-<serial_number> userData: name: master-user-data-managed namespace: openshift-machine-api EOF-
metadata.name: Specify the name of the secret. -
metadata.name: Replace{bmh_name}with the name of the BMH object that you deleted. -
bmc.address: Replace{uuid}with the UUID of the node that you created. -
bmc.credentialsName: Replacenamewith the name of the secret that you created. -
bootMACAddress: Specify the MAC address of the provisioning network interface. This is the MAC address the node uses to identify itself when communicating with Ironic during provisioning.
-
-
-
Verify that the new node has reached the
Provisionedstate by running the following command:$ oc get bmh -o wideThe value of the
STATUScolumn in the output of this command must beProvisioned.Note
The provisioning process can take 10 to 20 minutes to complete.
-
Verify that both control plane nodes are in the
Readystate by running the following command:$ oc get nodesThe value of the
STATUScolumn in the output of this command must beReadyfor both nodes. -
Apply the
detachedannotation to the BMH object to prevent the Machine API from managing it by running the following command:$ oc annotate bmh <bmh_name> -n openshift-machine-api baremetalhost.metal3.io/detached='' --overwrite -
Rejoin the replacement node to the pacemaker cluster by running the following command:
Note
Run the following command on the survivor control plane node, not the node being replaced.
$ sudo pcs cluster node remove <node_name>$ sudo pcs cluster node add <node_name> addr=<node_ip> --start --enable -
Delete stale jobs for the failed node by running the following command:
$ oc project openshift-etcd$ oc delete job tnf-auth-job-<node_name>$ oc delete job tnf-after-setup-job-<node_name>
For information about verifying that both control plane nodes and etcd are operating correctly, see "Verifying etcd health in a two-node OpenShift cluster with fencing".
Verifying etcd health in a two-node OpenShift cluster with fencing
After completing node recovery or maintenance procedures, verify that both control plane nodes and etcd are operating correctly.
-
You have access to the cluster as a user with
cluster-adminprivileges. -
You can access at least one control plane node through SSH.
-
Check the overall node status by running the following command:
$ oc get nodesThis command verifies that both control plane nodes are in the
Readystate, indicating that they can receive workloads for scheduling. -
Verify the status of the
cluster-etcd-operatorby running the following command:$ oc describe co/etcdThe
cluster-etcd-operatormanages and reports on the health of your etcd setup. Reviewing its status helps you identify any ongoing issues or degraded conditions. -
Review the etcd member list by running the following command:
$ oc rsh -n openshift-etcd <etcd_pod> etcdctl member list -w tableThis command shows the current etcd members and their roles. Look for any nodes marked as
learner, which indicates that they are in the process of becoming voting members. -
Review the Pacemaker resource status by running the following command on either control plane node:
$ sudo pcs status --fullThis command provides a detailed overview of all resources managed by Pacemaker. You must ensure that the following conditions are met:
-
Both nodes are online.
-
The
kubeletandetcdresources are running. -
Fencing is correctly configured for both nodes.
-