Managing distributed workloads with the Leader Worker Set Operator

You can use the Leader Worker Set Operator to manage distributed inference workloads and process large-scale inference requests efficiently.

Installing the Leader Worker Set Operator

You can install the Leader Worker Set Operator through the OpenShift Container Platform web console to begin managing distributed AI workloads.

Prerequisites

You have access to the cluster with cluster-admin privileges.
You have access to the OpenShift Container Platform web console.
You have installed the cert-manager Operator for Red Hat OpenShift.

Procedure

Log in to the OpenShift Container Platform web console.
Verify that the cert-manager Operator for Red Hat OpenShift is installed.
Install the Leader Worker Set Operator.
1. Navigate to Ecosystem → Software Catalog.
2. Enter Leader Worker Set Operator into the filter box.
3. Select the Leader Worker Set Operator and click Install.
4. On the Install Operator page:
  1. The Update channel is set to stable-v1.0, which installs the latest stable release of Leader Worker Set Operator 1.0.
  2. Under Installation mode, select A specific namespace on the cluster.
  3. Under Installed Namespace, select Operator recommended Namespace: openshift-lws-operator.
  4. Under Update approval, select one of the following update strategies:
    
    The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
    
    The Manual strategy requires a user with appropriate credentials to approve the Operator update.
  5. Click Install.
Create the custom resource (CR) for the Leader Worker Set Operator:
1. Navigate to Installed Operators → Leader Worker Set Operator.
2. Under Provided APIs, click Create instance in the LeaderWorkerSetOperator pane.
3. Click Create.

Deploying a leader worker set

You can use the Leader Worker Set Operator to deploy a leader worker set to assist with managing distributed workloads across nodes.

Prerequisites

You have installed the Leader Worker Set Operator.

Procedure

Create a new project by running the following command:
```
$ oc new-project my-namespace
```
Create a file named leader-worker-set.yaml
```
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  generation: 1
  name: my-lws
  namespace: my-namespace
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      metadata: {}
      spec:
        containers:
        - image: nginxinc/nginx-unprivileged:1.27
          name: leader
          resources: {}
    restartPolicy: RecreateGroupOnPodRestart
    size: 3
    workerTemplate:
      metadata: {}
      spec:
        containers:
        - image: nginxinc/nginx-unprivileged:1.27
          name: worker
          ports:
          - containerPort: 8080
            protocol: TCP
          resources: {}
  networkConfig:
    subdomainPolicy: Shared
  replicas: 2
  rolloutStrategy:
    rollingUpdateConfiguration:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  startupPolicy: LeaderCreated
```
where:

metadata.name

Specifies the name of the leader worker set resource.

metadata.namespace

Specifies the namespace for the leader worker set to run in.

spec.leaderWorkerTemplate.leaderTemplate

Specifies the pod template for the leader pods.

spec.leaderWorkerTemplate.restartPolicy

Specifies the restart policy for when pod failures occur. Allowed values are RecreateGroupOnPodRestart to restart the whole group or None to not restart the group.

spec.leaderWorkerTemplate.size

Specifies the number of pods to create for each group, including the leader pod. For example, a value of 3 creates 1 leader pod and 2 worker pods. The default value is 1.

spec.leaderWorkerTemplate.workerTemplate

Specifies the pod template for the worker pods.

spec.networkConfig.subdomainPolicy

Specifies the policy to use when creating the headless service. Allowed values are UniquePerReplica or Shared. The default value is Shared.

spec.replicas

Specifies the number of replicas, or leader-worker groups. The default value is 1.

spec.rolloutStrategy.rollingUpdateConfiguration.maxSurge

Specifies the maximum number of replicas that can be scheduled above the replicas value during rolling updates. The value can be specified as an integer or a percentage.

For more information on all available fields to configure, see LeaderWorkerSet API upstream documentation.
Apply the leader worker set configuration by running the following command:
```
$ oc apply -f leader-worker-set.yaml
```

Verification

Verify that pods were created by running the following command:

$ oc get pods -n my-namespace

Example output

NAME         READY   STATUS    RESTARTS   AGE
my-lws-0     1/1     Running   0          4s
my-lws-0-1   1/1     Running   0          3s
my-lws-0-2   1/1     Running   0          3s
my-lws-1     1/1     Running   0          7s
my-lws-1-1   1/1     Running   0          6s
my-lws-1-2   1/1     Running   0          6s

my-lws-0 is the leader pod for the first group.
my-lws-1 is the leader pod for the second group.

Review the stateful sets by running the following command:
```
$ oc get statefulsets
```
Example output
```
NAME       READY   AGE
my-lws     4/4     111s
my-lws-0   2/2     57s
my-lws-1   2/2     60s
```
- my-lws is the leader stateful set for all leader-worker groups.
- my-lws-0 is the worker stateful set for the first group.
- my-lws-1 is the worker stateful set for the second group.

Additional resources

LeaderWorkerSet API (Kubernetes)