Performing an image-based upgrade for {sno} clusters with the {lcao}
You can use the Lifecycle Agent to do a manual image-based upgrade of a single-node OpenShift cluster.
When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade CR is automatically created.
You update this CR to specify the image repository of the seed image and to move through the different stages.
Moving to the Prep stage of the image-based upgrade with Lifecycle Agent
When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade custom resource (CR) is automatically created.
After you created all the resources that you need during the upgrade, you can move on to the Prep stage.
For more information, see the "Creating ConfigMap objects for the image-based upgrade with Lifecycle Agent" section.
Note
In a disconnected environment, if the seed cluster’s release image registry is different from the target cluster’s release image registry, you must create an ImageDigestMirrorSet (IDMS) resource to configure alternative mirrored repository locations. For more information, see "Configuring image registry repository mirroring".
You can retrieve the release registry used in the seed image by running the following command:
$ skopeo inspect docker://<imagename> | jq -r '.Labels."com.openshift.lifecycle-agent.seed_cluster_info" | fromjson | .release_registry'
-
You have created resources to back up and restore your clusters.
-
Check that you have patched your
ImageBasedUpgradeCR:apiVersion: lca.openshift.io/v1 kind: ImageBasedUpgrade metadata: name: upgrade spec: stage: Idle seedImageRef: version: 4.15.2 image: <seed_container_image> pullSecretRef: <seed_pull_secret> autoRollbackOnFailure: {} # initMonitorTimeoutSeconds: 1800 extraManifests: - name: example-extra-manifests-cm namespace: openshift-lifecycle-agent - name: example-catalogsources-cm namespace: openshift-lifecycle-agent oadpContent: - name: oadp-cm-example namespace: openshift-adp- Target platform version. The value must match the version of the seed image.
- Repository where the target cluster can pull the seed image from.
- Reference to a secret with credentials to pull container images if the images are in a private registry.
- Optional: Time frame in seconds to roll back if the upgrade does not complete within that time frame after the first reboot. If not defined or set to
0, the default value of1800seconds (30 minutes) is used. - Optional: List of
ConfigMapresources that contain your custom catalog sources to retain after the upgrade and your extra manifests to apply to the target cluster that are not part of the seed image. - List of
ConfigMapresources that contain the OADPBackupandRestoreCRs.
-
To start the
Prepstage, change the value of thestagefield toPrepin theImageBasedUpgradeCR by running the following command:$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Prep"}}' --type=merge -n openshift-lifecycle-agentIf you provide
ConfigMapobjects for OADP resources and extra manifests, Lifecycle Agent validates the specifiedConfigMapobjects during thePrepstage. You might encounter the following issues:-
Validation warnings or errors if the Lifecycle Agent detects any issues with the
extraManifestsparameters. -
Validation errors if the Lifecycle Agent detects any issues with the
oadpContentparameters.
Validation warnings do not block the
Upgradestage but you must decide if it is safe to proceed with the upgrade. These warnings, for example missing CRDs, namespaces, or dry run failures, update thestatus.conditionsfor thePrepstage andannotationfields in theImageBasedUpgradeCR with details about the warning.Example validation warning# ... metadata: annotations: extra-manifest.lca.openshift.io/validation-warning: '...' # ...However, validation errors, such as adding
MachineConfigor Operator manifests to extra manifests, cause thePrepstage to fail and block theUpgradestage.When the validations pass, the cluster creates a new
ostreestateroot, which involves pulling and unpacking the seed image, and running host-level commands. Finally, all the required images are precached on the target cluster. -
-
Check the status of the
ImageBasedUpgradeCR by running the following command:$ oc get ibu -o yamlExample outputconditions: - lastTransitionTime: "2024-01-01T09:00:00Z" message: In progress observedGeneration: 13 reason: InProgress status: "False" type: Idle - lastTransitionTime: "2024-01-01T09:00:00Z" message: Prep completed observedGeneration: 13 reason: Completed status: "False" type: PrepInProgress - lastTransitionTime: "2024-01-01T09:00:00Z" message: Prep stage completed successfully observedGeneration: 13 reason: Completed status: "True" type: PrepCompleted observedGeneration: 13 validNextStages: - Idle - Upgrade
Moving to the Upgrade stage of the image-based upgrade with Lifecycle Agent
After you generate the seed image and complete the Prep stage, you can upgrade the target cluster.
During the upgrade process, the OADP Operator creates a backup of the artifacts specified in the OADP custom resources (CRs), then the Lifecycle Agent upgrades the cluster.
If the upgrade fails or stops, an automatic rollback is initiated. If you have an issue after the upgrade, you can initiate a manual rollback. For more information about manual rollback, see "Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent".
-
You have completed the
Prepstage.
-
To move to the
Upgradestage, change the value of thestagefield toUpgradein theImageBasedUpgradeCR by running the following command:$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Upgrade"}}' --type=merge -
Check the status of the
ImageBasedUpgradeCR by running the following command:$ oc get ibu -o yamlExample outputstatus: conditions: - lastTransitionTime: "2024-01-01T09:00:00Z" message: In progress observedGeneration: 5 reason: InProgress status: "False" type: Idle - lastTransitionTime: "2024-01-01T09:00:00Z" message: Prep completed observedGeneration: 5 reason: Completed status: "False" type: PrepInProgress - lastTransitionTime: "2024-01-01T09:00:00Z" message: Prep completed successfully observedGeneration: 5 reason: Completed status: "True" type: PrepCompleted - lastTransitionTime: "2024-01-01T09:00:00Z" message: |- Waiting for system to stabilize: one or more health checks failed - one or more ClusterOperators not yet ready: authentication - one or more MachineConfigPools not yet ready: master - one or more ClusterServiceVersions not yet ready: sriov-fec.v2.8.0 observedGeneration: 1 reason: InProgress status: "True" type: UpgradeInProgress observedGeneration: 1 rollbackAvailabilityExpiration: "2024-05-19T14:01:52Z" validNextStages: - RollbackThe OADP Operator creates a backup of the data specified in the OADP
BackupandRestoreCRs and the target cluster reboots. -
Monitor the status of the CR by running the following command:
$ oc get ibu -o yaml -
If you are satisfied with the upgrade, finalize the changes by patching the value of the
stagefield toIdlein theImageBasedUpgradeCR by running the following command:$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Idle"}}' --type=mergeImportant
You cannot roll back the changes once you move to the
Idlestage after an upgrade.The Lifecycle Agent deletes all resources created during the upgrade process.
-
You can remove the OADP Operator and its configuration files after a successful upgrade. For more information, see "Deleting Operators from a cluster".
-
Check the status of the
ImageBasedUpgradeCR by running the following command:$ oc get ibu -o yamlExample outputstatus: conditions: - lastTransitionTime: "2024-01-01T09:00:00Z" message: In progress observedGeneration: 5 reason: InProgress status: "False" type: Idle - lastTransitionTime: "2024-01-01T09:00:00Z" message: Prep completed observedGeneration: 5 reason: Completed status: "False" type: PrepInProgress - lastTransitionTime: "2024-01-01T09:00:00Z" message: Prep completed successfully observedGeneration: 5 reason: Completed status: "True" type: PrepCompleted - lastTransitionTime: "2024-01-01T09:00:00Z" message: Upgrade completed observedGeneration: 1 reason: Completed status: "False" type: UpgradeInProgress - lastTransitionTime: "2024-01-01T09:00:00Z" message: Upgrade completed observedGeneration: 1 reason: Completed status: "True" type: UpgradeCompleted observedGeneration: 1 rollbackAvailabilityExpiration: "2024-01-01T09:00:00Z" validNextStages: - Idle - Rollback -
Check the status of the cluster restoration by running the following command:
$ oc get restores -n openshift-adp -o custom-columns=NAME:.metadata.name,Status:.status.phase,Reason:.status.failureReasonExample outputNAME Status Reason acm-klusterlet Completed <none> apache-app Completed <none> localvolume Completed <none>- The
acm-klusterletis specific to RHACM environments only.
- The
Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent
An automatic rollback is initiated if the upgrade does not complete within the time frame specified in the initMonitorTimeoutSeconds field after rebooting.
apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
name: upgrade
spec:
stage: Idle
seedImageRef:
version: 4.15.2
image: <seed_container_image>
autoRollbackOnFailure: {}
# initMonitorTimeoutSeconds: 1800
# ...
- Optional: The time frame in seconds to roll back if the upgrade does not complete within that time frame after the first reboot. If not defined or set to
0, the default value of1800seconds (30 minutes) is used.
You can manually roll back the changes if you encounter unresolvable issues after an upgrade.
-
You have logged into the hub cluster as a user with
cluster-adminprivileges. -
You ensured that the control plane certificates on the original stateroot are valid. If the certificates expired, see "Recovering from expired control plane certificates".
Warning
If you choose to upgrade a recently installed single-node OpenShift cluster for example, for testing purposes, you have a limited rollback timeframe of 24 hours or less. You can verify the rollback time by checking the rollbackAvailabilityExpiration field of the ImageBasedUpgrade custom resource.
-
To move to the rollback stage, patch the value of the
stagefield toRollbackin theImageBasedUpgradeCR by running the following command:$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Rollback"}}' --type=mergeThe Lifecycle Agent reboots the cluster with the previously installed version of OpenShift Container Platform and restores the applications.
-
If you are satisfied with the changes, finalize the rollback by patching the value of the
stagefield toIdlein theImageBasedUpgradeCR by running the following command:$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Idle"}}' --type=merge -n openshift-lifecycle-agentWarning
If you move to the
Idlestage after a rollback, the Lifecycle Agent cleans up resources that can be used to troubleshoot a failed upgrade.
Troubleshooting image-based upgrades with Lifecycle Agent
Perform troubleshooting steps on the managed clusters that are affected by an issue.
Important
If you are using the ImageBasedGroupUpgrade CR to upgrade your clusters, ensure that the lcm.openshift.io/ibgu-<stage>-completed or lcm.openshift.io/ibgu-<stage>-failed cluster labels are updated properly after performing troubleshooting or recovery steps on the managed clusters.
This ensures that the TALM continues to manage the image-based upgrade for the cluster.
Collecting logs
You can use the oc adm must-gather CLI to collect information for debugging and troubleshooting.
-
Collect data about the Operators by running the following command:
$ oc adm must-gather \ --dest-dir=must-gather/tmp \ --image=$(oc -n openshift-lifecycle-agent get deployment.apps/lifecycle-agent-controller-manager -o jsonpath='{.spec.template.spec.containers[?(@.name == "manager")].image}') \ --image=quay.io/konveyor/oadp-must-gather:latest \// --image=quay.io/openshift/origin-must-gather:latest- Optional: Add this option if you need to gather more information from the OADP Operator.
- Optional: Add this option if you need to gather more information from the SR-IOV Operator.
AbortFailed or FinalizeFailed error
- Issue
-
During the finalize stage or when you stop the process at the
Prepstage, Lifecycle Agent cleans up the following resources:-
Stateroot that is no longer required
-
Precaching resources
-
OADP CRs
-
ImageBasedUpgradeCR
If the Lifecycle Agent fails to perform the above steps, it transitions to the
AbortFailedorFinalizeFailedstates. The condition message and log show which steps failed.Example error messagemessage: failed to delete all the backup CRs. Perform cleanup manually then add 'lca.openshift.io/manual-cleanup-done' annotation to ibu CR to transition back to Idle observedGeneration: 5 reason: AbortFailed status: "False" type: Idle -
- Resolution
-
-
Inspect the logs to determine why the failure occurred.
-
To prompt Lifecycle Agent to retry the cleanup, add the
lca.openshift.io/manual-cleanup-doneannotation to theImageBasedUpgradeCR.After observing this annotation, Lifecycle Agent retries the cleanup and, if it is successful, the
ImageBasedUpgradestage transitions toIdle.If the cleanup fails again, you can manually clean up the resources.
-
Cleaning up stateroot manually
- Issue
-
Stopping at the
Prepstage, Lifecycle Agent cleans up the new stateroot. When finalizing after a successful upgrade or a rollback, Lifecycle Agent cleans up the old stateroot. If this step fails, it is recommended that you inspect the logs to determine why the failure occurred. - Resolution
-
-
Check if there are any existing deployments in the stateroot by running the following command:
$ ostree admin status -
If there are any, clean up the existing deployment by running the following command:
$ ostree admin undeploy <index_of_deployment> -
After cleaning up all the deployments of the stateroot, wipe the stateroot directory by running the following commands:
Warning
Ensure that the booted deployment is not in this stateroot.
$ stateroot="<stateroot_to_delete>"$ unshare -m /bin/sh -c "mount -o remount,rw /sysroot && rm -rf /sysroot/ostree/deploy/${stateroot}"
-
Cleaning up OADP resources manually
- Issue
-
Automatic cleanup of OADP resources can fail due to connection issues between Lifecycle Agent and the S3 backend. By restoring the connection and adding the
lca.openshift.io/manual-cleanup-doneannotation, the Lifecycle Agent can successfully cleanup backup resources. - Resolution
-
Check the backend connectivity by running the following command:
$ oc get backupstoragelocations.velero.io -n openshift-adpExample outputNAME PHASE LAST VALIDATED AGE DEFAULT dataprotectionapplication-1 Available 33s 8d true -
Remove all backup resources and then add the
lca.openshift.io/manual-cleanup-doneannotation to theImageBasedUpgradeCR.
LVM Storage volume contents not restored
When LVM Storage is used to provide dynamic persistent volume storage, LVM Storage might not restore the persistent volume contents if it is configured incorrectly.
Missing LVM Storage-related fields in Backup CR
- Issue
-
Your
BackupCRs might be missing fields that are needed to restore your persistent volumes. You can check for events in your application pod to determine if you have this issue by running the following:$ oc describe pod <your_app_name>Example output showing missing LVM Storage-related fields in Backup CREvents: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 58s (x2 over 66s) default-scheduler 0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.. Normal Scheduled 56s default-scheduler Successfully assigned default/db-1234 to sno1.example.lab Warning FailedMount 24s (x7 over 55s) kubelet MountVolume.SetUp failed for volume "pvc-1234" : rpc error: code = Unknown desc = VolumeID is not found - Resolution
-
You must include
logicalvolumes.topolvm.ioin the applicationBackupCR. Without this resource, the application restores its persistent volume claims and persistent volume manifests correctly, however, thelogicalvolumeassociated with this persistent volume is not restored properly after pivot.Example Backup CRapiVersion: velero.io/v1 kind: Backup metadata: labels: velero.io/storage-location: default name: small-app namespace: openshift-adp spec: includedNamespaces: - test includedNamespaceScopedResources: - secrets - persistentvolumeclaims - deployments - statefulsets includedClusterScopedResources: - persistentVolumes - volumesnapshotcontents - logicalvolumes.topolvm.io- To restore the persistent volumes for your application, you must configure this section as shown.
Missing LVM Storage-related fields in Restore CR
- Issue
-
The expected resources for the applications are restored but the persistent volume contents are not preserved after upgrading.
-
List the persistent volumes for you applications by running the following command before pivot:
$ oc get pv,pvc,logicalvolumes.topolvm.io -AExample output before pivotNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-1234 1Gi RWO Retain Bound default/pvc-db lvms-vg1 4h45m NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE default persistentvolumeclaim/pvc-db Bound pvc-1234 1Gi RWO lvms-vg1 4h45m NAMESPACE NAME AGE logicalvolume.topolvm.io/pvc-1234 4h45m -
List the persistent volumes for you applications by running the following command after pivot:
$ oc get pv,pvc,logicalvolumes.topolvm.io -AExample output after pivotNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-1234 1Gi RWO Delete Bound default/pvc-db lvms-vg1 19s NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE default persistentvolumeclaim/pvc-db Bound pvc-1234 1Gi RWO lvms-vg1 19s NAMESPACE NAME AGE logicalvolume.topolvm.io/pvc-1234 18s
-
- Resolution
-
The reason for this issue is that the
logicalvolumestatus is not preserved in theRestoreCR. This status is important because it is required for Velero to reference the volumes that must be preserved after pivoting. You must include the following fields in the applicationRestoreCR:Example Restore CRapiVersion: velero.io/v1 kind: Restore metadata: name: sample-vote-app namespace: openshift-adp labels: velero.io/storage-location: default annotations: lca.openshift.io/apply-wave: "3" spec: backupName: sample-vote-app restorePVs: true restoreStatus: includedResources: - logicalvolumes- To preserve the persistent volumes for your application, you must set
restorePVstotrue. - To preserve the persistent volumes for your application, you must configure this section as shown.
- To preserve the persistent volumes for your application, you must set
Debugging failed Backup and Restore CRs
- Issue
-
The backup or restoration of artifacts failed.
- Resolution
-
You can debug
BackupandRestoreCRs and retrieve logs with the Velero CLI tool. The Velero CLI tool provides more detailed information than the OpenShift CLI tool.-
Describe the
BackupCR that contains errors by running the following command:$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe backup -n openshift-adp backup-acm-klusterlet --details -
Describe the
RestoreCR that contains errors by running the following command:$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe restore -n openshift-adp restore-acm-klusterlet --details -
Download the backed up resources to a local directory by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero backup download -n openshift-adp backup-acm-klusterlet -o ~/backup-acm-klusterlet.tar.gz
-