Storing and recording data for core platform monitoring
Store and record your metrics and alerting data, configure logs to specify which activities are recorded, control how long Prometheus retains stored data, and set the maximum amount of disk space for the data. These actions help you protect your data and use them for troubleshooting.
Configuring persistent storage
Run cluster monitoring with persistent storage to gain the following benefits:
-
Protect your metrics and alerting data from data loss by storing them in a persistent volume (PV). As a result, they can survive pods being restarted or recreated.
-
Avoid getting duplicate notifications and losing silences for alerts when the Alertmanager pods are restarted.
Important
In multi-node clusters, you must configure persistent storage for Prometheus and Alertmanager to ensure high availability.
Note
For production environments, it is highly recommended to configure persistent storage.
Persistent storage prerequisites
-
Dedicate sufficient persistent storage to ensure that the disk does not become full.
-
Use
Filesystemas the storage type value for thevolumeModeparameter when you configure the persistent volume.Important
-
Do not use a raw block volume, which is described with
volumeMode: Blockin thePersistentVolumeresource. Prometheus cannot use raw block volumes. -
Prometheus does not support file systems that are not POSIX compliant. For example, some NFS file system implementations are not POSIX compliant. If you want to use an NFS file system for storage, verify with the vendor that their NFS implementation is fully POSIX compliant.
-
Configuring a persistent volume claim
To use a persistent volume (PV) for monitoring components, you must configure a persistent volume claim (PVC).
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject. -
You have installed the OpenShift CLI (
oc).
-
Edit the
cluster-monitoring-configconfig map in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add your PVC configuration for the component under
data/config.yaml:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | <component>: volumeClaimTemplate: spec: storageClassName: <storage_class> resources: requests: storage: <amount_of_storage>- Specify the monitoring component for which you want to configure the PVC.
- Specify an existing storage class. If a storage class is not specified, the default storage class is used.
- Specify the amount of required storage.
The following example configures a PVC that claims persistent storage for Prometheus:
Example PVC configurationapiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: volumeClaimTemplate: spec: storageClassName: my-storage-class resources: requests: storage: 40Gi
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed and the new storage configuration is applied.
Warning
When you update the config map with a PVC configuration, the affected
StatefulSetobject is recreated, resulting in a temporary service outage.
Resizing a persistent volume
You can resize a persistent volume (PV) for monitoring components, such as Prometheus or Alertmanager. You need to manually expand a persistent volume claim (PVC), and then update the config map in which the component is configured.
Important
You can only expand the size of the PVC. Shrinking the storage size is not possible.
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject. -
You have configured at least one PVC for core OpenShift Container Platform monitoring components.
-
You have installed the OpenShift CLI (
oc).
-
Manually expand a PVC with the updated storage request. For more information, see "Expanding persistent volume claims (PVCs) with a file system" in Expanding persistent volumes.
-
Edit the
cluster-monitoring-configconfig map in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add a new storage size for the PVC configuration for the component under
data/config.yaml:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | <component>: volumeClaimTemplate: spec: resources: requests: storage: <amount_of_storage>- The component for which you want to change the storage size.
- Specify the new size for the storage volume. It must be greater than the previous value.
The following example sets the new PVC request to 100 gigabytes for the Prometheus instance:
Example storage configuration forprometheusK8sapiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: volumeClaimTemplate: spec: resources: requests: storage: 100Gi
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
Warning
When you update the config map with a new storage size, the affected
StatefulSetobject is recreated, resulting in a temporary service outage.
Modifying retention time and size for Prometheus metrics data
By default, Prometheus retains metrics data for 15 days for core platform monitoring. You can modify the retention time for the Prometheus instance to change when the data is deleted. You can also set the maximum amount of disk space the retained metrics data uses.
Note
Data compaction occurs every two hours. Therefore, a persistent volume (PV) might fill up before compaction, potentially exceeding the retentionSize limit. In such cases, the KubePersistentVolumeFillingUp alert fires until the space on a PV is lower than the retentionSize limit.
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject. -
You have installed the OpenShift CLI (
oc).
-
Edit the
cluster-monitoring-configconfig map in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add the retention time and size configuration under
data/config.yaml:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: retention: <time_specification> retentionSize: <size_specification>- The retention time: a number directly followed by
ms(milliseconds),s(seconds),m(minutes),h(hours),d(days),w(weeks), ory(years). You can also combine time values for specific times, such as1h30m15s. - The retention size: a number directly followed by
B(bytes),KB(kilobytes),MB(megabytes),GB(gigabytes),TB(terabytes),PB(petabytes), andEB(exabytes).The following example sets the retention time to 24 hours and the retention size to 10 gigabytes for the Prometheus instance:
Example of setting retention time for PrometheusapiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: retention: 24h retentionSize: 10GB
- The retention time: a number directly followed by
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
Configuring audit logs for Metrics Server
You can configure audit logs for Metrics Server to help you troubleshoot issues with the server. Audit logs record the sequence of actions in a cluster. It can record user, application, or control plane activities.
You can configure audit log rules to record specific events and a subset of associated data. The following audit profiles define configuration rules:
-
Metadata(default): This profile logs event metadata including user, timestamps, resource, and verb. It does not record request and response bodies. -
Request: This profile logs event metadata and request body, but it does not record response body. This configuration does not apply to non-resource requests. -
RequestResponse: This profile logs event metadata, and request and response bodies. This configuration does not apply to non-resource requests. -
None: None of the previously described events are recorded.
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject. -
You have installed the OpenShift CLI (
oc).
-
Edit the
cluster-monitoring-configconfig map in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add audit log configuration for Metrics Server under
data/config.yaml:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | metricsServer: audit: profile: <audit_log_profile>- Specify the audit profile for Metrics Server.
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
-
Verify that the audit profile is applied:
$ oc -n openshift-monitoring get deploy metrics-server -o yaml | grep -- '--audit-policy-file=*'Example output- --audit-policy-file=/etc/audit/request-profile.yaml
Setting log levels for monitoring components
You can configure the log level for Alertmanager, Prometheus Operator, Prometheus, and Thanos Querier and log verbosity for Metrics Server. You can use these settings for troubleshooting and to gain better insight into how the components are functioning.
The following log levels can be applied to the relevant component in the cluster-monitoring-config ConfigMap object:
-
debug. Log debug, informational, warning, and error messages. -
info(default). Log informational, warning, and error messages. -
warn. Log warning and error messages only. -
error. Log error messages only.
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject. -
You have installed the OpenShift CLI (
oc).
-
Edit the
cluster-monitoring-configconfig map in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add log configuration for a component under
data/config.yaml:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | <component>: logLevel: <log_level> metricsServer: verbosity: <value> # ...- Specify the monitoring stack component for which you are setting a log level.
Available component values are
prometheusK8s,alertmanagerMain,prometheusOperator, andthanosQuerier. - Specify the log level for the component.
The available values are
error,warn,info, anddebug. The default value isinfo. - Specify the verbosity for Metrics Server.
Valid values are positive integers.
Increasing the number increases the amount of logged events, values over
10are usually unnecessary. The default value is0.
- Specify the monitoring stack component for which you are setting a log level.
Available component values are
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
-
Verify that the log configuration is applied by reviewing the deployment or pod configuration in the related project.
-
The following example checks the log level for the
prometheus-operatordeployment:$ oc -n openshift-monitoring get deploy prometheus-operator -o yaml | grep "log-level"Example output- --log-level=debug -
The following example checks the log verbosity for the
metrics-serverdeployment:$ oc -n openshift-monitoring get deploy metrics-server -o yaml | grep -- '--v='Example output- --v=3
-
-
Verify that the pods for the component are running:
$ oc -n openshift-monitoring get podsNote
If an unrecognized
logLevelvalue is included in theConfigMapobject, the pods for the component might not restart successfully.
Enabling the query log file for Prometheus
You can configure Prometheus to write all queries that have been run by the engine to a log file.
Important
Because log rotation is not supported, only enable this feature temporarily when you need to troubleshoot an issue. After you finish troubleshooting, disable query logging by reverting the changes you made to the ConfigMap object to enable the feature.
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject. -
You have installed the OpenShift CLI (
oc).
-
Edit the
cluster-monitoring-configconfig map in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add the
queryLogFileparameter for Prometheus underdata/config.yaml:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | prometheusK8s: queryLogFile: <path>- Add the full path to the file in which queries will be logged.
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
-
Verify that the pods for the component are running. The following sample command lists the status of pods:
$ oc -n openshift-monitoring get podsExample output... prometheus-operator-567c9bc75c-96wkj 2/2 Running 0 62m prometheus-k8s-0 6/6 Running 1 57m prometheus-k8s-1 6/6 Running 1 57m thanos-querier-56c76d7df4-2xkpc 6/6 Running 0 57m thanos-querier-56c76d7df4-j5p29 6/6 Running 0 57m ... -
Read the query log:
$ oc -n openshift-monitoring exec prometheus-k8s-0 -- cat <path>Important
Revert the setting in the config map after you have examined the logged query information.
Enabling query logging for Thanos Querier
For default platform monitoring in the openshift-monitoring project, you can enable the Cluster Monitoring Operator (CMO) to log all queries run by Thanos Querier.
Important
Because log rotation is not supported, only enable this feature temporarily when you need to troubleshoot an issue. After you finish troubleshooting, disable query logging by reverting the changes you made to the ConfigMap object to enable the feature.
-
You have installed the OpenShift CLI (
oc). -
You have access to the cluster as a user with the
cluster-admincluster role. -
You have created the
cluster-monitoring-configConfigMapobject.
You can enable query logging for Thanos Querier in the openshift-monitoring project:
-
Edit the
cluster-monitoring-configConfigMapobject in theopenshift-monitoringproject:$ oc -n openshift-monitoring edit configmap cluster-monitoring-config -
Add a
thanosQueriersection underdata/config.yamland add values as shown in the following example:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | thanosQuerier: enableRequestLogging: <value> logLevel: <value>- Set the value to
trueto enable logging andfalseto disable logging. The default value isfalse. - Set the value to
debug,info,warn, orerror. If no value exists forlogLevel, the log level defaults toerror.
- Set the value to
-
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
-
Verify that the Thanos Querier pods are running. The following sample command lists the status of pods in the
openshift-monitoringproject:$ oc -n openshift-monitoring get pods -
Run a test query using the following sample commands as a model:
$ token=`oc create token prometheus-k8s -n openshift-monitoring`$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=cluster_version' -
Run the following command to read the query log:
$ oc -n openshift-monitoring logs <thanos_querier_pod_name> -c thanos-queryNote
Because the
thanos-querierpods are highly available (HA) pods, you might be able to see logs in only one pod. -
After you examine the logged query information, disable query logging by changing the
enableRequestLoggingvalue tofalsein the config map.