Skip to content

Managing distributed workloads with the Leader Worker Set Operator

You can use the Leader Worker Set Operator to manage distributed inference workloads and process large-scale inference requests efficiently.

Installing the Leader Worker Set Operator

You can install the Leader Worker Set Operator through the OpenShift Container Platform web console to begin managing distributed AI workloads.

Prerequisites
  • You have access to the cluster with cluster-admin privileges.

  • You have access to the OpenShift Container Platform web console.

  • You have installed the cert-manager Operator for Red Hat OpenShift.

Procedure
  1. Log in to the OpenShift Container Platform web console.

  2. Verify that the cert-manager Operator for Red Hat OpenShift is installed.

  3. Install the Leader Worker Set Operator.

    1. Navigate to EcosystemSoftware Catalog.

    2. Enter Leader Worker Set Operator into the filter box.

    3. Select the Leader Worker Set Operator and click Install.

    4. On the Install Operator page:

      1. The Update channel is set to stable-v1.0, which installs the latest stable release of Leader Worker Set Operator 1.0.

      2. Under Installation mode, select A specific namespace on the cluster.

      3. Under Installed Namespace, select Operator recommended Namespace: openshift-lws-operator.

      4. Under Update approval, select one of the following update strategies:

        • The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.

        • The Manual strategy requires a user with appropriate credentials to approve the Operator update.

      5. Click Install.

  4. Create the custom resource (CR) for the Leader Worker Set Operator:

    1. Navigate to Installed OperatorsLeader Worker Set Operator.

    2. Under Provided APIs, click Create instance in the LeaderWorkerSetOperator pane.

    3. Click Create.

Deploying a leader worker set

You can use the Leader Worker Set Operator to deploy a leader worker set to assist with managing distributed workloads across nodes.

Prerequisites
  • You have installed the Leader Worker Set Operator.

Procedure
  1. Create a new project by running the following command:

    $ oc new-project my-namespace
  2. Create a file named leader-worker-set.yaml

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      generation: 1
      name: my-lws
      namespace: my-namespace
    spec:
      leaderWorkerTemplate:
        leaderTemplate:
          metadata: {}
          spec:
            containers:
            - image: nginxinc/nginx-unprivileged:1.27
              name: leader
              resources: {}
        restartPolicy: RecreateGroupOnPodRestart
        size: 3
        workerTemplate:
          metadata: {}
          spec:
            containers:
            - image: nginxinc/nginx-unprivileged:1.27
              name: worker
              ports:
              - containerPort: 8080
                protocol: TCP
              resources: {}
      networkConfig:
        subdomainPolicy: Shared
      replicas: 2
      rolloutStrategy:
        rollingUpdateConfiguration:
          maxSurge: 1
          maxUnavailable: 1
        type: RollingUpdate
      startupPolicy: LeaderCreated

    where:

    metadata.name

    Specifies the name of the leader worker set resource.

    metadata.namespace

    Specifies the namespace for the leader worker set to run in.

    spec.leaderWorkerTemplate.leaderTemplate

    Specifies the pod template for the leader pods.

    spec.leaderWorkerTemplate.restartPolicy

    Specifies the restart policy for when pod failures occur. Allowed values are RecreateGroupOnPodRestart to restart the whole group or None to not restart the group.

    spec.leaderWorkerTemplate.size

    Specifies the number of pods to create for each group, including the leader pod. For example, a value of 3 creates 1 leader pod and 2 worker pods. The default value is 1.

    spec.leaderWorkerTemplate.workerTemplate

    Specifies the pod template for the worker pods.

    spec.networkConfig.subdomainPolicy

    Specifies the policy to use when creating the headless service. Allowed values are UniquePerReplica or Shared. The default value is Shared.

    spec.replicas

    Specifies the number of replicas, or leader-worker groups. The default value is 1.

    spec.rolloutStrategy.rollingUpdateConfiguration.maxSurge

    Specifies the maximum number of replicas that can be scheduled above the replicas value during rolling updates. The value can be specified as an integer or a percentage.

    For more information on all available fields to configure, see LeaderWorkerSet API upstream documentation.

  3. Apply the leader worker set configuration by running the following command:

    $ oc apply -f leader-worker-set.yaml
Verification
  1. Verify that pods were created by running the following command:

    $ oc get pods -n my-namespace
    Example output
    NAME         READY   STATUS    RESTARTS   AGE
    my-lws-0     1/1     Running   0          4s
    my-lws-0-1   1/1     Running   0          3s
    my-lws-0-2   1/1     Running   0          3s
    my-lws-1     1/1     Running   0          7s
    my-lws-1-1   1/1     Running   0          6s
    my-lws-1-2   1/1     Running   0          6s
    • my-lws-0 is the leader pod for the first group.

    • my-lws-1 is the leader pod for the second group.

  2. Review the stateful sets by running the following command:

    $ oc get statefulsets
    Example output
    NAME       READY   AGE
    my-lws     4/4     111s
    my-lws-0   2/2     57s
    my-lws-1   2/2     60s
    • my-lws is the leader stateful set for all leader-worker groups.

    • my-lws-0 is the worker stateful set for the first group.

    • my-lws-1 is the worker stateful set for the second group.