Troubleshooting Compliance Operator scans

This section describes how to troubleshoot the Compliance Operator. The information can be useful either to diagnose a problem or provide information in a bug report. Some general tips:

The Compliance Operator emits Kubernetes events when something important happens. You can either view all events in the cluster using the command:
```
 $ oc get events -n openshift-compliance
```
Or view events for an object like a scan using the command:
```
$ oc describe -n openshift-compliance compliancescan/cis-compliance
```
The Compliance Operator consists of several controllers, approximately one per API object. It could be useful to filter only those controllers that correspond to the API object having issues. If a ComplianceRemediation cannot be applied, view the messages from the remediationctrl controller. You can filter the messages from a single controller by parsing with jq:
```
$ oc -n openshift-compliance logs compliance-operator-775d7bddbd-gj58f \
    | jq -c 'select(.logger == "profilebundlectrl")'
```
The timestamps are logged as seconds since UNIX epoch in UTC. To convert them to a human-readable date, use date -d @timestamp --utc, for example:
```
$ date -d @1596184628.955853 --utc
```
Many custom resources, most importantly ComplianceSuite and ScanSetting, allow the debug option to be set. Enabling this option increases verbosity of the OpenSCAP scanner pods, as well as some other helper pods.
If a single rule is passing or failing unexpectedly, it could be helpful to run a single scan or a suite with only that rule to find the rule ID from the corresponding ComplianceCheckResult object and use it as the rule attribute value in a Scan CR. Then, together with the debug option enabled, the scanner container logs in the scanner pod would show the raw OpenSCAP logs.

Anatomy of a scan

The following sections outline the components and stages of Compliance Operator scans.

The compliance content is stored in Profile objects that are generated from a ProfileBundle object. The Compliance Operator creates a ProfileBundle object for the cluster and another for the cluster nodes.

$ oc get -n openshift-compliance profilebundle.compliance

$ oc get -n openshift-compliance profile.compliance

The ProfileBundle objects are processed by deployments labeled with the Bundle name. To troubleshoot an issue with the Bundle, you can find the deployment and view logs of the pods in a deployment:

$ oc logs -n openshift-compliance -lprofile-bundle=ocp4 -c profileparser

$ oc get -n openshift-compliance deployments,pods -lprofile-bundle=ocp4

$ oc logs -n openshift-compliance pods/<pod-name>

$ oc describe -n openshift-compliance pod/<pod-name> -c profileparser

The ScanSetting and ScanSettingBinding objects lifecycle and debugging

With valid compliance content sources, the high-level ScanSetting and ScanSettingBinding objects can be used to generate ComplianceSuite and ComplianceScan objects:

apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSetting
metadata:
  name: my-companys-constraints
debug: true
# For each role, a separate scan will be created pointing
# to a node-role specified in roles
roles:
  - worker
---
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
  name: my-companys-compliance-requirements
profiles:
  # Node checks
  - name: rhcos4-e8
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
  # Cluster checks
  - name: ocp4-e8
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
settingsRef:
  name: my-companys-constraints
  kind: ScanSetting
  apiGroup: compliance.openshift.io/v1alpha1

Both ScanSetting and ScanSettingBinding objects are handled by the same controller tagged with logger=scansettingbindingctrl. These objects have no status. Any issues are communicated in form of events:

Events:
  Type     Reason        Age    From                    Message
  ----     ------        ----   ----                    -------
  Normal   SuiteCreated  9m52s  scansettingbindingctrl  ComplianceSuite openshift-compliance/my-companys-compliance-requirements created

Now a ComplianceSuite object is created. The flow continues to reconcile the newly created ComplianceSuite.

ComplianceSuite custom resource lifecycle and debugging

The ComplianceSuite CR is a wrapper around ComplianceScan CRs. The ComplianceSuite CR is handled by controller tagged with logger=suitectrl. This controller handles creating scans from a suite, reconciling and aggregating individual Scan statuses into a single Suite status. If a suite is set to execute periodically, the suitectrl also handles creating a CronJob CR that re-runs the scans in the suite after the initial run is done:

$ oc get cronjobs

Example output

NAME                                           SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
<cron_name>                                    0 1 * * *   False     0        <none>          151m

For the most important issues, events are emitted. View them with oc describe compliancesuites/<name>. The Suite objects also have a Status subresource that is updated when any of Scan objects that belong to this suite update their Status subresource. After all expected scans are created, control is passed to the scan controller.

ComplianceScan custom resource lifecycle and debugging

The ComplianceScan CRs are handled by the scanctrl controller. This is also where the actual scans happen and the scan results are created. Each scan goes through several phases:

Pending phase

The scan is validated for correctness in this phase. If some parameters like storage size are invalid, the scan transitions to DONE with ERROR result, otherwise proceeds to the Launching phase.

Launching phase

In this phase, several config maps that contain either environment for the scanner pods or directly the script that the scanner pods will be evaluating. List the config maps:

$ oc -n openshift-compliance get cm \
-l compliance.openshift.io/scan-name=rhcos4-e8-worker,complianceoperator.openshift.io/scan-script=

These config maps will be used by the scanner pods. If you ever needed to modify the scanner behavior, change the scanner debug level or print the raw results, modifying the config maps is the way to go. Afterwards, a persistent volume claim is created per scan to store the raw ARF results:

$ oc get pvc -n openshift-compliance -lcompliance.openshift.io/scan-name=rhcos4-e8-worker

The PVCs are mounted by a per-scan ResultServer deployment. A ResultServer is a simple HTTP server where the individual scanner pods upload the full ARF results to. Each server can run on a different node. The full ARF results might be very large and you cannot presume that it would be possible to create a volume that could be mounted from multiple nodes at the same time. After the scan is finished, the ResultServer deployment is scaled down. The PVC with the raw results can be mounted from another custom pod and the results can be fetched or inspected. The traffic between the scanner pods and the ResultServer is protected by mutual TLS protocols.

Finally, the scanner pods are launched in this phase; one scanner pod for a Platform scan instance and one scanner pod per matching node for a node scan instance. The per-node pods are labeled with the node name. Each pod is always labeled with the ComplianceScan name:

$ oc get pods -lcompliance.openshift.io/scan-name=rhcos4-e8-worker,workload=scanner --show-labels

Example output

NAME                                                              READY   STATUS      RESTARTS   AGE   LABELS
rhcos4-e8-worker-ip-10-0-169-90.eu-north-1.compute.internal-pod   0/2     Completed   0          39m   compliance.openshift.io/scan-name=rhcos4-e8-worker,targetNode=ip-10-0-169-90.eu-north-1.compute.internal,workload=scanner

+ The scan then proceeds to the Running phase.

Running phase

The running phase waits until the scanner pods finish. The following terms and processes are in use in the running phase:

init container: There is one init container called content-container. It runs the contentImage container and executes a single command that copies the contentFile to the /content directory shared with the other containers in this pod.
scanner: This container runs the scan. For node scans, the container mounts the node filesystem as /host and mounts the content delivered by the init container. The container also mounts the entrypoint ConfigMap created in the Launching phase and executes it. The default script in the entrypoint ConfigMap executes OpenSCAP and stores the result files in the /results directory shared between the pod’s containers. Logs from this pod can be viewed to determine what the OpenSCAP scanner checked. More verbose output can be viewed with the debug flag.

logcollector: The logcollector container waits until the scanner container finishes. Then, it uploads the full ARF results to the ResultServer and separately uploads the XCCDF results along with scan result and OpenSCAP result code as a ConfigMap. These result config maps are labeled with the scan name (compliance.openshift.io/scan-name=rhcos4-e8-worker):

$ oc describe cm/rhcos4-e8-worker-ip-10-0-169-90.eu-north-1.compute.internal-pod

Example output

      Name:         rhcos4-e8-worker-ip-10-0-169-90.eu-north-1.compute.internal-pod
      Namespace:    openshift-compliance
      Labels:       compliance.openshift.io/scan-name-scan=rhcos4-e8-worker
                    complianceoperator.openshift.io/scan-result=
      Annotations:  compliance-remediations/processed:
                    compliance.openshift.io/scan-error-msg:
                    compliance.openshift.io/scan-result: NON-COMPLIANT
                    OpenSCAP-scan-result/node: ip-10-0-169-90.eu-north-1.compute.internal

      Data
      ====
      exit-code:
      ----
      2
      results:
      ----
      <?xml version="1.0" encoding="UTF-8"?>
      ...

Scanner pods for Platform scans are similar, except:

There is one extra init container called api-resource-collector that reads the OpenSCAP content provided by the content-container init, container, figures out which API resources the content needs to examine and stores those API resources to a shared directory where the scanner container would read them from.
The scanner container does not need to mount the host file system.

When the scanner pods are done, the scans move on to the Aggregating phase.

Aggregating phase

In the aggregating phase, the scan controller spawns yet another pod called the aggregator pod. Its purpose it to take the result ConfigMap objects, read the results and for each check result create the corresponding Kubernetes object. If the check failure can be automatically remediated, a ComplianceRemediation object is created. To provide human-readable metadata for the checks and remediations, the aggregator pod also mounts the OpenSCAP content using an init container.

When a config map is processed by an aggregator pod, it is labeled the compliance-remediations/processed label. The result of this phase are ComplianceCheckResult objects:

$ oc get compliancecheckresults -lcompliance.openshift.io/scan-name=rhcos4-e8-worker

Example output

NAME                                                       STATUS   SEVERITY
rhcos4-e8-worker-accounts-no-uid-except-zero               PASS     high
rhcos4-e8-worker-audit-rules-dac-modification-chmod        FAIL     medium

and ComplianceRemediation objects:

$ oc get complianceremediations -lcompliance.openshift.io/scan-name=rhcos4-e8-worker

Example output

NAME                                                       STATE
rhcos4-e8-worker-audit-rules-dac-modification-chmod        NotApplied
rhcos4-e8-worker-audit-rules-dac-modification-chown        NotApplied
rhcos4-e8-worker-audit-rules-execution-chcon               NotApplied
rhcos4-e8-worker-audit-rules-execution-restorecon          NotApplied
rhcos4-e8-worker-audit-rules-execution-semanage            NotApplied
rhcos4-e8-worker-audit-rules-execution-setfiles            NotApplied

After these CRs are created, the aggregator pod exits and the scan moves on to the Done phase.

Done phase

In the final scan phase, the scan resources are cleaned up if needed and the ResultServer deployment is either scaled down (if the scan was one-time) or deleted if the scan is continuous; the next scan instance would then recreate the deployment again.

It is also possible to trigger a re-run of a scan in the Done phase by annotating it:

$ oc -n openshift-compliance \
annotate compliancescans/rhcos4-e8-worker compliance.openshift.io/rescan=

After the scan reaches the Done phase, nothing else happens on its own unless the remediations are set to be applied automatically with autoApplyRemediations: true. The OpenShift Container Platform administrator would now review the remediations and apply them as needed. If the remediations are set to be applied automatically, the ComplianceSuite controller takes over in the Done phase, pauses the machine config pool to which the scan maps to and applies all the remediations in one go. If a remediation is applied, the ComplianceRemediation controller takes over.

ComplianceRemediation controller lifecycle and debugging

The example scan has reported some findings. One of the remediations can be enabled by toggling its apply attribute to true:

$ oc patch complianceremediations/rhcos4-e8-worker-audit-rules-dac-modification-chmod --patch '{"spec":{"apply":true}}' --type=merge

The ComplianceRemediation controller (logger=remediationctrl) reconciles the modified object. The result of the reconciliation is change of status of the remediation object that is reconciled, but also a change of the rendered per-suite MachineConfig object that contains all the applied remediations.

The MachineConfig object always begins with 75- and is named after the scan and the suite:

$ oc get mc | grep 75-

Example output

75-rhcos4-e8-worker-my-companys-compliance-requirements                                                3.2.0             2m46s

The remediations the mc currently consists of are listed in the machine config’s annotations:

$ oc describe mc/75-rhcos4-e8-worker-my-companys-compliance-requirements

Example output

Name:         75-rhcos4-e8-worker-my-companys-compliance-requirements
Labels:       machineconfiguration.openshift.io/role=worker
Annotations:  remediation/rhcos4-e8-worker-audit-rules-dac-modification-chmod:

The ComplianceRemediation controller’s algorithm works like this:

All currently applied remediations are read into an initial remediation set.
If the reconciled remediation is supposed to be applied, it is added to the set.
A MachineConfig object is rendered from the set and annotated with names of remediations in the set. If the set is empty (the last remediation was unapplied), the rendered MachineConfig object is removed.
If and only if the rendered machine config is different from the one already applied in the cluster, the applied MC is updated (or created, or deleted).
Creating or modifying a MachineConfig object triggers a reboot of nodes that match the machineconfiguration.openshift.io/role label - see the Machine Config Operator documentation for more details.

The remediation loop ends once the rendered machine config is updated, if needed, and the reconciled remediation object status is updated. In our case, applying the remediation would trigger a reboot. After the reboot, annotate the scan to re-run it:

$ oc -n openshift-compliance \
annotate compliancescans/rhcos4-e8-worker compliance.openshift.io/rescan=

The scan will run and finish. Check for the remediation to pass:

$ oc -n openshift-compliance \
get compliancecheckresults/rhcos4-e8-worker-audit-rules-dac-modification-chmod

Example output

NAME                                                  STATUS   SEVERITY
rhcos4-e8-worker-audit-rules-dac-modification-chmod   PASS     medium

Useful labels

Each pod that is spawned by the Compliance Operator is labeled specifically with the scan it belongs to and the work it does. The scan identifier is labeled with the compliance.openshift.io/scan-name label. The workload identifier is labeled with the workload label.

The Compliance Operator schedules the following workloads:

scanner: Performs the compliance scan.
resultserver: Stores the raw results for the compliance scan.
aggregator: Aggregates the results, detects inconsistencies and outputs result objects (checkresults and remediations).
suitererunner: Will tag a suite to be re-run (when a schedule is set).
profileparser: Parses a datastream and creates the appropriate profiles, rules and variables.

When debugging and logs are required for a certain workload, run:

$ oc logs -l workload=<workload_name> -c <container_name>

Increasing Compliance Operator resource limits

In some cases, the Compliance Operator might require more memory than the default limits allow. The best way to mitigate this issue is to set custom resource limits.

To increase the default memory and CPU limits of scanner pods, see `ScanSetting` Custom resource.

Procedure

To increase the Operator’s memory limits to 500 Mi, create the following patch file named co-memlimit-patch.yaml:
```
spec:
  config:
    resources:
      limits:
        memory: 500Mi
```

Apply the patch file:

$ oc patch sub compliance-operator -nopenshift-compliance --patch-file co-memlimit-patch.yaml --type=merge

Configuring Operator resource constraints

The resources field defines Resource Constraints for all the containers in the Pod created by the Operator Lifecycle Manager (OLM).

Note

Resource Constraints applied in this process overwrites the existing resource constraints.

Procedure

Inject a request of 0.25 cpu and 64 Mi of memory, and a limit of 0.5 cpu and 128 Mi of memory in each container by editing the Subscription object:

kind: Subscription
metadata:
  name: compliance-operator
  namespace: openshift-compliance
spec:
  package: package-name
  channel: stable
  config:
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Configuring ScanSetting resources

When using the Compliance Operator in a cluster that contains more than 500 MachineConfigs, the ocp4-pci-dss-api-checks-pod pod may pause in the init phase when performing a Platform scan.

Note

Resource constraints applied in this process overwrites the existing resource constraints.

Procedure

Confirm the ocp4-pci-dss-api-checks-pod pod is stuck in the Init:OOMKilled status:

$ oc get pod ocp4-pci-dss-api-checks-pod -w

Example output

NAME                          READY   STATUS     RESTARTS        AGE
ocp4-pci-dss-api-checks-pod   0/2     Init:1/2   8 (5m56s ago)   25m
ocp4-pci-dss-api-checks-pod   0/2     Init:OOMKilled   8 (6m19s ago)   26m

Edit the scanLimits attribute in the ScanSetting CR to increase the available memory for the ocp4-pci-dss-api-checks-pod pod:

timeout: 30m
strictNodeScan: true
metadata:
  name: default
  namespace: openshift-compliance
kind: ScanSetting
showNotApplicable: false
rawResultStorage:
  nodeSelector:
    node-role.kubernetes.io/master: ''
  pvAccessModes:
    - ReadWriteOnce
  rotation: 3
  size: 1Gi
  tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
schedule: 0 1 * * *
roles:
  - master
  - worker
apiVersion: compliance.openshift.io/v1alpha1
maxRetryOnTimeout: 3
scanTolerations:
  - operator: Exists
scanLimits:
  memory: 1024Mi

The default setting is 500Mi.

Apply the ScanSetting CR to your cluster:
```
$ oc apply -f scansetting.yaml
```

Configuring ScanSetting timeout

The ScanSetting object has a timeout option that can be specified in the ComplianceScanSetting object as a duration string, such as 1h30m. If the scan does not finish within the specified timeout, the scan reattempts until the maxRetryOnTimeout limit is reached.

Procedure

To set a timeout and maxRetryOnTimeout in ScanSetting, modify an existing ScanSetting object:

apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSetting
metadata:
  name: default
  namespace: openshift-compliance
rawResultStorage:
  rotation: 3
  size: 1Gi
roles:
- worker
- master
scanTolerations:
- effect: NoSchedule
  key: node-role.kubernetes.io/master
  operator: Exists
schedule: '0 1 * * *'
timeout: '10m0s' 
maxRetryOnTimeout: 3

The timeout variable is defined as a duration string, such as 1h30m. The default value is 30m. To disable the timeout, set the value to 0s.
The maxRetryOnTimeout variable defines how many times a retry is attempted. The default value is 3.

Getting support

If you experience difficulty with a procedure described in this documentation, or with OpenShift Container Platform in general, visit the Red Hat Customer Portal.

From the Customer Portal, you can:

Search or browse through the Red Hat Knowledgebase of articles and solutions relating to Red Hat products.
Submit a support case to Red Hat Support.
Access other product documentation.

To identify issues with your cluster, you can use Red Hat Lightspeed in OpenShift Cluster Manager. Red Hat Lightspeed provides details about issues and, if available, information on how to solve a problem.

If you have a suggestion for improving this documentation or have found an error, submit a Jira issue for the most relevant documentation component. Please provide specific details, such as the section name and OpenShift Container Platform version.