About disaster recovery

The disaster recovery documentation provides information for administrators on how to recover from several disaster situations that might occur with their OpenShift Container Platform cluster. As an administrator, you might need to follow one or more of the following procedures to return your cluster to a working state.

Important

Disaster recovery requires you to have at least one healthy control plane host.

Quorum restoration: This solution handles situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. This solution does not require an etcd backup.

Note

If you have a majority of your control plane nodes still available and have an etcd quorum, then replace a single unhealthy etcd member.
Restoring to a previous cluster state: This solution handles situations where you want to restore your cluster to a previous state, for example, if an administrator deletes something critical. If you have taken an etcd backup, you can restore your cluster to a previous state.

If applicable, you might also need to recover from expired control plane certificates.

Warning

Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. This procedure should only be used as a last resort.

Prior to performing a restore, see About restoring cluster state for more information on the impact to the cluster.
Recovering from expired control plane certificates: This solution handles situations where your control plane certificates have expired. For example, if you shut down your cluster before the first certificate rotation, which occurs 24 hours after installation, your certificates will not be rotated and will expire. You can follow this procedure to recover from expired control plane certificates.

Testing restore procedures

Testing the restore procedure is important to ensure that your automation and workload handle the new cluster state gracefully. Due to the complex nature of etcd quorum and the etcd Operator attempting to mend automatically, it is often difficult to correctly bring your cluster into a broken enough state that it can be restored.

Warning

You must have SSH access to the cluster. Your cluster might be entirely lost without SSH access.

Prerequisites

You have SSH access to control plane hosts.
You have installed the OpenShift CLI (oc).

Procedure

Use SSH to connect to each of your nonrecovery nodes and run the following commands to disable etcd and the kubelet service:
1. Disable etcd by running the following command:
  $ sudo /usr/local/bin/disable-etcd.sh
2. Delete variable data for etcd by running the following command:
  $ sudo rm -rf /var/lib/etcd
3. Disable the kubelet service by running the following command:
  $ sudo systemctl disable kubelet.service
Exit every SSH session.
Run the following command to ensure that your nonrecovery nodes are in a NOT READY state:
```
$ oc get nodes
```
Follow the steps in "Restoring to a previous cluster state" to restore your cluster.
After you restore the cluster and the API responds, use SSH to connect to each nonrecovery node and enable the kubelet service:
```
$ sudo systemctl enable kubelet.service
```
Exit every SSH session.
Run the following command to observe your nodes coming back into the READY state:
```
$ oc get nodes
```
Run the following command to verify that etcd is available:
```
$ oc get pods -n openshift-etcd
```

Additional resources

Restoring to a previous cluster state