Overview of responsibilities for OpenShift Container Platform
This documentation outlines Red Hat, Amazon Web Services (AWS), and customer responsibilities for the OpenShift Container Platform managed service.
Shared responsibilities for OpenShift Container Platform
While Red Hat and Amazon Web Services (AWS) manage the OpenShift Container Platform services, the customer shares certain responsibilities. The OpenShift Container Platform services are accessed remotely, hosted on public cloud resources, created in customer-owned AWS accounts, and have underlying platform and data security that is owned by Red Hat.
Important
If the cluster-admin role is added to a user, see the responsibilities and exclusion notes in the Red Hat Enterprise Agreement Appendix 4 (Online Subscription Services).
| Resource | Incident and operations management | Change management | Access and identity authorization | Security and regulation compliance | Disaster recovery |
|---|---|---|---|---|---|
Customer data |
Customer |
Customer |
Customer |
Customer |
Customer |
Customer applications |
Customer |
Customer |
Customer |
Customer |
Customer |
Developer services |
Customer |
Customer |
Customer |
Customer |
Customer |
Platform monitoring |
Red Hat |
Red Hat |
Red Hat |
Red Hat |
Red Hat |
Logging |
Red Hat |
Red Hat and Customer |
Red Hat and Customer |
Red Hat and Customer |
Red Hat |
Application networking |
Red Hat and Customer |
Red Hat and Customer |
Red Hat and Customer |
Red Hat |
Red Hat |
Cluster networking |
Red Hat [1] |
Red Hat and Customer [2] |
Red Hat and Customer |
Red Hat [1] |
Red Hat [1] |
Virtual networking management |
Red Hat and Customer |
Red Hat and Customer |
Red Hat and Customer |
Red Hat and Customer |
Red Hat and Customer |
Virtual compute management (control plane, infrastructure and worker nodes) |
Red Hat |
Red Hat |
Red Hat |
Red Hat |
Red Hat |
Cluster version |
Red Hat |
Red Hat and Customer |
Red Hat |
Red Hat |
Red Hat |
Capacity management |
Red Hat |
Red Hat and Customer |
Red Hat |
Red Hat |
Red Hat |
Virtual storage management |
Red Hat |
Red Hat |
Red Hat |
Red Hat |
Red Hat |
AWS software (public AWS services) |
AWS |
AWS |
AWS |
AWS |
AWS |
Hardware/AWS global infrastructure |
AWS |
AWS |
AWS |
AWS |
AWS |
-
If the customer chooses to use their own CNI plugin, the responsibility shifts to the customer.
-
The customer must configure their firewall to grant access to the required OpenShift and AWS domains and ports before the cluster is provisioned. For more information, see "AWS firewall prerequisites".
Tasks for shared responsibilities by area
Red Hat, AWS, and the customer all share responsibility for the monitoring, maintenance, and overall health of a OpenShift Container Platform (ROSA) cluster. This documentation illustrates the delineation of responsibilities for each of the listed resources as shown in the tables below.
Review and action cluster notifications
Cluster notifications (sometimes referred to as service logs) are messages about the status, health, or performance of your cluster.
Cluster notifications are the primary way that Red Hat Site Reliability Engineering (SRE) communicates with you about the health of your managed cluster. Red Hat SRE may also use cluster notifications to prompt you to perform an action in order to resolve or prevent an issue with your cluster.
Cluster owners and administrators must regularly review and action cluster notifications to ensure clusters remain healthy and supported.
You can view cluster notifications in the Red Hat Hybrid Cloud Console, in the Cluster history tab for your cluster. By default, only the cluster owner receives cluster notifications as emails. If other users need to receive cluster notification emails, add each user as a notification contact for your cluster.
Cluster notification policy
Cluster notifications are designed to keep you informed about the health of your cluster and high impact events that affect it.
Most cluster notifications are generated and sent automatically to ensure that you are immediately informed of problems or important changes to the state of your cluster.
In certain situations, Red Hat Site Reliability Engineering (SRE) creates and sends cluster notifications to provide additional context and guidance for a complex issue.
Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by Red Hat SRE.
Red Hat services automatically send notifications when:
-
Remote health monitoring or environment verification checks detect an issue in your cluster, for example, when a worker node has low disk space.
-
Significant cluster life cycle events occur, for example, when scheduled maintenance or upgrades begin, or cluster operations are impacted by an event, but do not require customer intervention.
-
Significant cluster management changes occur, for example, when cluster ownership or administrative control is transferred from one user to another.
-
Your cluster subscription is changed or updated, for example, when Red Hat makes updates to subscription terms or features available to your cluster.
SRE creates and sends notifications when:
-
An incident results in a degradation or outage that impacts your cluster’s availability or performance, for example, your cloud provider has a regional outage. SRE sends subsequent notifications to inform you of incident resolution progress, and when the incident is resolved.
-
A security vulnerability, security breach, or unusual activity is detected on your cluster.
-
Red Hat detects that changes you have made are creating or may result in cluster instability.
-
Red Hat detects that your workloads are causing performance degradation or instability in your cluster.
Incident and operations management
Red Hat is responsible for overseeing the service components required for default platform networking. AWS is responsible for protecting the hardware infrastructure that runs all of the services offered in the AWS Cloud. The customer is responsible for incident and operations management of customer application data and any custom networking the customer has configured for the cluster network or virtual network.
| Resource | Service responsibilities | Customer responsibilities |
|---|---|---|
Application networking |
Red Hat
|
|
Cluster networking |
Red Hat
|
|
Virtual networking management |
Red Hat
|
|
Virtual storage management |
Red Hat
|
|
Platform monitoring |
Red Hat
|
|
Incident management |
Red Hat
|
|
Infrastructure and data resiliency |
Red Hat
|
|
Cluster capacity |
Red Hat
|
|
AWS software (public AWS services) |
AWS
|
|
Hardware/AWS global infrastructure |
AWS
|
|
Platform monitoring
Platform audit logs are securely forwarded to a centralized security information and event monitoring (SIEM) system, where they may trigger configured alerts to the Red Hat SRE team and are also subject to manual review. Audit logs are retained in the SIEM system for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.
Incident management
An incident is an event that results in a degradation or outage of one or more Red Hat services.
An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
Depending on the impact on the service and customer, the incident is categorized in terms of severity.
When managing a new incident, Red Hat uses the following general workflow:
-
An SRE first responder is alerted to a new incident and begins an initial investigation.
-
After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
-
The incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
-
When the incident is resolved a brief summary of the incident and resolution are provided in the customer-initiated support ticket. This summary helps the customers understand the incident and its resolution in more detail.
If customers require more information in addition to what is provided in the support ticket, they can request the following workflow:
-
The customer must make a request for the additional information within 5 business days of the incident resolution.
-
Depending on the severity of the incident, Red Hat may provide customers with a root cause summary, or a root cause analysis (RCA) in the support ticket. The additional information will be provided within 7 business days for root cause summary and 30 business days for root cause analysis from the incident resolution.
Red Hat also assists with customer incidents raised through support cases. Red Hat can assist with activities including but not limited to:
-
Forensic gathering, including isolating virtual compute
-
Guiding compute image collection
-
Providing collected audit logs
Cluster capacity
The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.
Capacity evaluations by the Red Hat SRE staff also happen in response to alerts from the cluster, after usage thresholds are exceeded for a certain period of time. Such alerts can also result in a notification to the customer.
Change management
This section describes the policies about how cluster and configuration changes, patches, and releases are managed.
Red Hat is responsible for enabling changes to the cluster infrastructure and services that the customer will control, as well as maintaining versions for the control plane nodes, infrastructure nodes and services, and worker nodes. AWS is responsible for protecting the hardware infrastructure that runs all of the services offered in the AWS Cloud. The customer is responsible for initiating infrastructure change requests and installing and maintaining optional services and networking configurations on the cluster, as well as all changes to customer data and customer applications.
Customer-initiated changes
You can initiate changes using self-service capabilities such as cluster deployment, worker node scaling, or cluster deletion.
Change history is captured in the Cluster History section in the OpenShift Cluster Manager Overview tab, and is available for you to view. The change history includes, but is not limited to, logs from the following changes:
-
Adding or removing identity providers
-
Adding or removing users to or from the
dedicated-adminsgroup -
Scaling the cluster compute nodes
-
Scaling the cluster load balancer
-
Scaling the cluster persistent storage
-
Upgrading the cluster
You can implement a maintenance exclusion by avoiding changes in OpenShift Cluster Manager for the following components:
-
Deleting a cluster
-
Adding, modifying, or removing identity providers
-
Adding, modifying, or removing a user from an elevated group
-
Installing or removing add-ons
-
Modifying cluster networking configurations
-
Adding, modifying, or removing machine pools
-
Enabling or disabling user workload monitoring
-
Initiating an upgrade
Important
To enforce the maintenance exclusion, ensure machine pool autoscaling or automatic upgrade policies have been disabled. After the maintenance exclusion has been lifted, proceed with enabling machine pool autoscaling or automatic upgrade policies as desired.
Red Hat-initiated changes
Red Hat site reliability engineering (SRE) manages the infrastructure, code, and configuration of OpenShift Container Platform using a GitOps workflow and fully automated CI/CD pipelines. This process ensures that Red Hat can safely introduce service improvements on a continuous basis without negatively impacting customers.
Every proposed change undergoes a series of automated verifications immediately upon check-in. Changes are then deployed to a staging environment where they undergo automated integration testing. Finally, changes are deployed to the production environment. Each step is fully automated.
An authorized Red Hat SRE reviewer must approve advancement to each step. The reviewer cannot be the same individual who proposed the change. All changes and approvals are fully auditable as part of the GitOps workflow.
Some changes are released to production incrementally, using feature flags to control availability of new features to specified clusters or customers, such as private or public previews.
Patch management
OpenShift Container Platform software and the underlying immutable Red Hat CoreOS (RHCOS) operating system image are patched for bugs and vulnerabilities in regular z-stream upgrades. Read more about RHCOS architecture in the OpenShift Container Platform documentation.
Release management
Red Hat does not automatically upgrade your clusters. You can schedule to upgrade the clusters at regular intervals (recurring upgrade) or just once (individual upgrade) using the OpenShift Cluster Manager web console. Red Hat might forcefully upgrade a cluster to a new z-stream version only if the cluster is affected by a critical impact CVE.
Note
Because the required permissions can change between y-stream releases, the AWS managed policies are automatically updated before an upgrade can be performed.
You can review the history of all cluster upgrade events in the OpenShift Cluster Manager web console.
Service and Customer resource responsibilities
The following table defines the responsibilities for cluster resources.
| Resource | Service responsibilities | Customer responsibilities |
|---|---|---|
Logging |
Red Hat
|
|
Application networking |
Red Hat
|
|
Cluster networking |
Red Hat
|
|
Virtual networking management |
Red Hat
|
|
Virtual compute management |
Red Hat
|
|
Cluster version |
Red Hat
|
|
Capacity management |
Red Hat
|
|
Virtual storage management |
Red Hat
|
|
AWS software (public AWS services) |
AWS Compute: Provide the Amazon EC2 service, used for ROSA relevant resources. Storage: Provide Amazon EBS, used by ROSA to provision local node storage and persistent volume storage for the cluster. Storage: Provide Amazon S3, used for the ROSA built-in image registry. Networking: Provide the following AWS Cloud services, used by ROSA to satisfy virtual networking infrastructure needs:
Networking: Provide the following AWS services, which customers can optionally integrate with ROSA:
|
|
Hardware/AWS global infrastructure |
AWS
|
|
-
For more information on authentication flow for AWS STS, see Authentication flow for AWS STS.
-
For more information on pruning images, see Automatically pruning Images.
Security and regulation compliance
The following table outlines the the responsibilities in regards to security and regulation compliance:
| Resource | Service responsibilities | Customer responsibilities |
|---|---|---|
Logging |
Red Hat
|
|
Virtual networking management |
Red Hat
|
|
Virtual storage management |
Red Hat
|
|
Virtual compute management |
Red Hat
|
|
AWS software (public AWS services) |
AWS Compute: Secure Amazon EC2, used for ROSA used for ROSA For more information, see Infrastructure security in Amazon EC2 in the Amazon EC2 User Guide. Storage: Secure Amazon Elastic Block Store (EBS), used for ROSA as well as Kubernetes persistent volumes. For more information, see Data protection in Amazon EC2 in the Amazon EC2 User Guide. Storage: Provide AWS KMS, which ROSA uses to For more information, see Amazon EBS encryption in the Amazon EC2 User Guide. Storage: Secure Amazon S3, used for the ROSA service’s built-in container image registry. For more information, see Amazon S3 security in the S3 User Guide. Networking: Provide security capabilities and services to increase privacy and control network access on AWS global infrastructure, including network firewalls built into Amazon VPC, private or dedicated network connections, and automatic encryption of all traffic on the AWS global and regional networks between AWS secured facilities. For more information, see the AWS Shared Responsibility Model and Infrastructure security in the Introduction to AWS Security whitepaper. |
|
Hardware/AWS global infrastructure |
AWS
|
|
Disaster recovery
Disaster recovery includes data and configuration backup, replicating data and configuration to the disaster recovery environment, and failover on disaster events.
OpenShift Container Platform (ROSA) provides disaster recovery for failures that occur at the pod, node, and availability zone levels.
All disaster recovery requires that the customer use best practices for deploying highly available applications, storage, and cluster architecture, such as multiple machine pools across multiple availability zones, to account for the level of desired availability.
One cluster with a single machine pool will not provide disaster avoidance or recovery in the event of an availability zone or region outage. Multiple clusters with single machine pools with customer-maintained failover can account for outages at the zone or at the regional level.
One cluster with multiple machine pools across multiple availability zones will not provide disaster avoidance or recovery in the event of a full region outage. Multiple clusters in several regions with multiple machine pools in more than one availability-zone with customer-maintained failover can account for outages at the regional level.
| Resource | Service responsibilities | Customer responsibilities |
|---|---|---|
Virtual networking management |
Red Hat
|
|
Virtual Storage management |
Red Hat |
|
Virtual compute management |
Red Hat - Provide the ability for the customer to manually or automatically replace failed worker nodes. |
|
AWS software (public AWS services) |
AWS Compute: Provide Amazon EC2 features that support data resiliency such as Amazon EBS snapshots and Amazon EC2 Auto Scaling. For more information, see Resilience in Amazon EC2 in the EC2 User Guide. Storage: Provide the ability for the ROSA service and customers to back up the Amazon EBS volume on the cluster through Amazon EBS volume snapshots. Storage: For information about Amazon S3 features that support data resiliency, see Resilience in Amazon S3. Networking: For information about Amazon VPC features that support data resiliency, see Resilience in Amazon Virtual Private Cloud in the Amazon VPC User Guide. |
|
Hardware/AWS global infrastructure |
AWS
|
|
Red Hat managed resources
Overview
The following covers all OpenShift Container Platform resources that are managed or protected by the Service Reliability Engineering Platform (SRE-P) Team. Customers should not try to change these resources because doing so can lead to cluster instability.
Managed resources
The following list displays the OpenShift Container Platform resources managed by OpenShift Hive, the centralized fleet configuration management system. These resources are in addition to the OpenShift/ROSA platform resources created during installation. OpenShift Hive continually reconciles consistency across all OpenShift Container Platform clusters. Changes to OpenShift Container Platform resources should be made through OpenShift Cluster Manager so that OpenShift Cluster Manager and Hive are synchronized. Contact ocm-feedback@redhat.com if OpenShift Cluster Manager does not support modifying the resources in question.
(Note that the following may not be visible in your ROSA cluster)
Details
link:https://raw.githubusercontent.com/openshift/managed-cluster-config/master/resources/managed/all-osd-resources.yaml[role=include]
OpenShift Container Platform core namespaces
OpenShift Container Platform core namespaces are installed by default during cluster installation.
(Note that the following may not be visible in your ROSA cluster)
Details
link:https://raw.githubusercontent.com/openshift/managed-cluster-config/master/deploy/osd-managed-resources/ocp-namespaces.ConfigMap.yaml[role=include]
OpenShift Container Platform add-on namespaces
OpenShift Container Platform add-ons are services available for installation after cluster installation. These additional services include Red Hat OpenShift Dev Spaces, Red Hat OpenShift API Management, and Cluster Logging Operator. Any changes to resources within the following namespaces can be overridden by the add-on during upgrades, which can lead to unsupported configurations for the add-on functionality.
List of add-on managed namespaces
link:https://raw.githubusercontent.com/openshift/managed-cluster-config/master/resources/addons-namespaces/main.yaml[role=include]
OpenShift Container Platform validating webhooks
OpenShift Container Platform validating webhooks are a set of dynamic admission controls maintained by the OpenShift SRE team. These HTTP callbacks, also known as webhooks, are called for various types of requests to ensure cluster stability. The following list describes the various webhooks with rules containing the registered operations and resources that are controlled. Any attempt to circumvent these validating webhooks could affect the stability and supportability of the cluster.
List of validating webhooks
link:https://raw.githubusercontent.com/openshift/managed-cluster-validating-webhooks/master/docs/webhooks.json[role=include]
Additional customer responsibilities for data and applications
The customer is responsible for the applications, workloads, and data that they deploy to Red Hat OpenShift Service on AWS. However, Red Hat and AWS provide various tools to help the customer manage data and applications on the platform.
| Resource | Red Hat and AWS | Customer responsibilities |
|---|---|---|
Customer data |
Red Hat
AWS
|
|
Customer applications |
Red Hat
AWS
|
|