OADP monitoring

By using the OpenShift Container Platform monitoring stack, users and administrators can effectively perform the following tasks:

Monitor and manage clusters
Analyze the workload performance of user applications
Monitor services running on the clusters
Receive alerts if an event occurs

Additional resources

About OpenShift Container Platform monitoring

OADP monitoring setup

The OADP Operator leverages an OpenShift User Workload Monitoring provided by the OpenShift Monitoring Stack for retrieving metrics from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics by using the OpenShift Metrics query front end.

With enabled User Workload Monitoring, it is possible to configure and use any Prometheus-compatible third-party UI, such as Grafana, to visualize Velero metrics.

Monitoring metrics requires enabling monitoring for the user-defined projects and creating a ServiceMonitor resource to scrape those metrics from the already enabled OADP service endpoint that resides in the openshift-adp namespace.

Note

The OADP support for Prometheus metrics is offered on a best-effort basis and is not fully supported.

For more information about setting up the monitoring stack, see Configuring user workload monitoring.

Prerequisites

You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
You have created a cluster monitoring config map.

Procedure

Edit the cluster-monitoring-config ConfigMap object in the openshift-monitoring namespace by using the following command:
```
$ oc edit configmap cluster-monitoring-config -n openshift-monitoring
```
Add or enable the enableUserWorkload option in the data section’s config.yaml field by using the following command:
```
apiVersion: v1
kind: ConfigMap
data:
  config.yaml: |
    enableUserWorkload: true 
metadata:
# ...
```
1. Add this option or set to true

Wait a short period to verify the User Workload Monitoring Setup by checking that the following components are up and running in the openshift-user-workload-monitoring namespace:

$ oc get pods -n openshift-user-workload-monitoring

Example output

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-6844b4b99c-b57j9   2/2     Running   0          43s
prometheus-user-workload-0             5/5     Running   0          32s
prometheus-user-workload-1             5/5     Running   0          32s
thanos-ruler-user-workload-0           3/3     Running   0          32s
thanos-ruler-user-workload-1           3/3     Running   0          32s

Verify the existence of the user-workload-monitoring-config ConfigMap in the openshift-user-workload-monitoring. If it exists, skip the remaining steps in this procedure.
```
$ oc get configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
```
Example output
```
Error from server (NotFound): configmaps "user-workload-monitoring-config" not found
```
Create a user-workload-monitoring-config ConfigMap object for the User Workload Monitoring, and save it under the 2_configure_user_workload_monitoring.yaml file name:
Example output
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
```

Apply the 2_configure_user_workload_monitoring.yaml file by using the following command:

$ oc apply -f 2_configure_user_workload_monitoring.yaml
configmap/user-workload-monitoring-config created

Creating OADP service monitor

OADP provides an openshift-adp-velero-metrics-svc service, which is created when the Data Protection Application (DPA) is configured. The user workload monitoring service monitor must point to the defined service. To get details about the service, complete the following steps.

Procedure

Ensure that the openshift-adp-velero-metrics-svc service exists. It should contain app.kubernetes.io/name=velero label, which is used as selector for the ServiceMonitor object.

$ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero

Example output

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
openshift-adp-velero-metrics-svc   ClusterIP   172.30.38.244   <none>        8085/TCP   1h

Create a ServiceMonitor YAML file that matches the existing service label, and save the file as 3_create_oadp_service_monitor.yaml. The service monitor is created in the openshift-adp namespace where the openshift-adp-velero-metrics-svc service resides.

Example ServiceMonitor object

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: oadp-service-monitor
  name: oadp-service-monitor
  namespace: openshift-adp
spec:
  endpoints:
  - interval: 30s
    path: /metrics
    targetPort: 8085
    scheme: http
  selector:
    matchLabels:
      app.kubernetes.io/name: "velero"

Apply the 3_create_oadp_service_monitor.yaml file:

$ oc apply -f 3_create_oadp_service_monitor.yaml

Example output

servicemonitor.monitoring.coreos.com/oadp-service-monitor created

Verification

Confirm that the new service monitor is in an Up state by using the Administrator perspective of the OpenShift Container Platform web console. Wait a few minutes for the service monitor to reach the Up state.
1. Navigate to the Observe → Targets page.
2. Ensure the Filter is unselected or that the User source is selected and type openshift-adp in the Text search field.
3. Verify that the status for the Status for the service monitor is Up.
  
  Figure 1. OADP metrics targets

Creating an alerting rule

The OpenShift Container Platform monitoring stack receives Alerts configured by using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics scraped with the user workload monitoring.

Procedure

Create a PrometheusRule YAML file with the sample OADPBackupFailing alert and save it as 4_create_oadp_alert_rule.yaml:

Sample OADPBackupFailing alert

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sample-oadp-alert
  namespace: openshift-adp
spec:
  groups:
  - name: sample-oadp-backup-alert
    rules:
    - alert: OADPBackupFailing
      annotations:
        description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
        summary: OADP has issues creating backups
      expr: |
        increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
      for: 5m
      labels:
        severity: warning

In this sample, the Alert displays under the following conditions:

During the last 2 hours, the number of new failing backups was greater than 0 and the state persisted for at least 5 minutes.
If the time of the first increase is less than 5 minutes, the Alert is in a Pending state, after which it turns into a Firing state.

Apply the 4_create_oadp_alert_rule.yaml file, which creates the PrometheusRule object in the openshift-adp namespace:
```
$ oc apply -f 4_create_oadp_alert_rule.yaml
```
Example output
```
prometheusrule.monitoring.coreos.com/sample-oadp-alert created
```

Verification

After the Alert is triggered, you can view it in the following ways:
- In the Developer perspective, select the Observe menu.
- In the Administrator perspective under the Observe → Alerting menu, select User in the Filter box. Otherwise, by default only the Platform Alerts are displayed.
  
  Figure 2. OADP backup failing alert

Additional resources

Managing alerts as an Administrator

List of available metrics

Refer to the following table for a list of Velero metrics provided by OADP together with their Types:

Table 1. Velero metrics
Metric name	Description	Type
`velero_backup_tarball_size_bytes`	Size, in bytes, of a backup	Gauge
`velero_backup_total`	Current number of existent backups	Gauge
`velero_backup_attempt_total`	Total number of attempted backups	Counter
`velero_backup_success_total`	Total number of successful backups	Counter
`velero_backup_partial_failure_total`	Total number of partially failed backups	Counter
`velero_backup_failure_total`	Total number of failed backups	Counter
`velero_backup_validation_failure_total`	Total number of validation failed backups	Counter
`velero_backup_duration_seconds`	Time taken to complete backup, in seconds	Histogram
`velero_backup_duration_seconds_bucket`	Total count of observations for a bucket in the histogram for the metric `velero_backup_duration_seconds`	Counter
`velero_backup_duration_seconds_count`	Total count of observations for the metric `velero_backup_duration_seconds`	Counter
`velero_backup_duration_seconds_sum`	Total sum of observations for the metric `velero_backup_duration_seconds`	Counter
`velero_backup_deletion_attempt_total`	Total number of attempted backup deletions	Counter
`velero_backup_deletion_success_total`	Total number of successful backup deletions	Counter
`velero_backup_deletion_failure_total`	Total number of failed backup deletions	Counter
`velero_backup_last_successful_timestamp`	Last time a backup ran successfully, Unix timestamp in seconds	Gauge
`velero_backup_items_total`	Total number of items backed up	Gauge
`velero_backup_items_errors`	Total number of errors encountered during backup	Gauge
`velero_backup_warning_total`	Total number of warned backups	Counter
`velero_backup_last_status`	Last status of the backup. A value of 1 is success, 0 is failure	Gauge
`velero_restore_total`	Current number of existent restores	Gauge
`velero_restore_attempt_total`	Total number of attempted restores	Counter
`velero_restore_validation_failed_total`	Total number of failed restores failing validations	Counter
`velero_restore_success_total`	Total number of successful restores	Counter
`velero_restore_partial_failure_total`	Total number of partially failed restores	Counter
`velero_restore_failed_total`	Total number of failed restores	Counter
`velero_volume_snapshot_attempt_total`	Total number of attempted volume snapshots	Counter
`velero_volume_snapshot_success_total`	Total number of successful volume snapshots	Counter
`velero_volume_snapshot_failure_total`	Total number of failed volume snapshots	Counter
`velero_csi_snapshot_attempt_total`	Total number of CSI attempted volume snapshots	Counter
`velero_csi_snapshot_success_total`	Total number of CSI successful volume snapshots	Counter
`velero_csi_snapshot_failure_total`	Total number of CSI failed volume snapshots	Counter

Viewing metrics using the Observe UI

You can view metrics in the OpenShift Container Platform web console from the Administrator or Developer perspective, which must have access to the openshift-adp project.

Procedure

Navigate to the Observe → Metrics page:
- If you are using the Developer perspective, follow these steps:
  1. Select Custom query, or click on the Show PromQL link.
  2. Type the query and click Enter.
- If you are using the Administrator perspective, type the expression in the text field and select Run Queries.
  
  Figure 3. OADP metrics query