Managing workloads with the JobSet Operator
Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.
Deploying a JobSet
You can use the JobSet Operator to deploy a JobSet to manage and run large-scale, coordinated workloads.
-
You have installed the JobSet Operator.
-
You have a cluster with available NVIDIA GPUs.
-
Create a new project by running the following command:
$ oc new-project <my_namespace> -
Create a file named
jobset.yaml:apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: pytorch spec: replicatedJobs: - name: workers template: spec: parallelism: <pods_running_number> completions: <pods_finish_number> backoffLimit: 0 template: spec: imagePullSecrets: - name: my-registry-secret initContainers: - name: prepare image: docker.io/alpine/git:v2.52.0 args: ['clone', 'https://github.com/pytorch/examples'] volumeMounts: - name: workdir mountPath: /git containers: - name: pytorch image: docker.io/pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" ports: - containerPort: 4321 env: - name: MASTER_ADDR value: "pytorch-workers-0-0.pytorch" - name: MASTER_PORT value: "4321" - name: RANK valueFrom: fieldRef: fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] - name: PYTHONUNBUFFERED value: "0" command: - /bin/sh - -c - | cd examples/distributed/ddp-tutorial-series torchrun --nproc_per_node=1 --nnodes=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT multinode.py 1000 100 volumeMounts: - name: workdir mountPath: /workspace volumes: - name: workdir emptyDir: {}where:
<pods_running_number>-
Specifies the number of pods running at the same time.
<pods_finish_number>-
Specifies the total number of pods that must finish successfully for the job to be marked complete.
-
Apply the JobSet configuration by running the following command:
$ oc apply -f jobset.yaml
-
Verify that pods were started by running the following command:
$ oc get pods -n <my_namespace>Example outputNAME READY STATUS RESTARTS AGE pytorch-workers-0-0-2lzwt 1/1 Running 0 2m17s pytorch-workers-0-1-g2lrv 1/1 Running 0 2m17s pytorch-workers-0-2-dpljq 1/1 Running 0 2m17s
Specifying a JobSet coordinator
To manage communication between JobSet pods, you can assign a specific JobSet coordinator pod. This ensures that your distributed workloads can reference a stable network endpoint as a central point of coordination for task synchronization and data exchange.
-
You have installed the JobSet Operator.
-
Create a new namespace by running the following command.
$ oc new-project <new_namespace> -
Create a YAML file called
jobset-coordinator.yaml:Example YAML fileapiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: coordinator spec: coordinator: replicatedJob: driver jobIndex: 0 podIndex: 0 replicatedJobs: - name: workers template: spec: parallelism: <pods_running_number> completions: <pods_finish_number> backoffLimit: 0 template: spec: containers: - name: worker env: - name: COORDINATOR_ENDPOINT valueFrom: fieldRef: fieldPath: metadata.labels['jobset.sigs.k8s.io/coordinator'] image: quay.io/nginx/nginx-unprivileged:1.29-alpine command: [ "/bin/sh", "-c" ] args: - | while ! curl -s "${COORDINATOR_ENDPOINT}:8080" | grep Welcome; do sleep 3 done sleep 100 - name: driver template: spec: parallelism: <pods_running_number> completions: <pods_finish_number> backoffLimit: 0 template: spec: containers: - name: driver image: quay.io/nginx/nginx-unprivileged:1.29-alpine ports: - containerPort: 8080 protocol: TCPwhere:
<pods_running_number>-
Specifies the number of pods running at the same time.
<pods_finish_number>-
Specifies the total number of pods that must finish successfully for the job to be marked complete.
-
Apply the
jobset-coordinator.yamlfile by running the following command:$ oc apply -f jobset-coordinator.yaml
-
Verify that pods were created by running the following command:
$ oc get pods -n <new_namespace>Example outputNAME READY STATUS RESTARTS AGE coordinator-driver-0-0-svgk7 1/1 Running 0 67s coordinator-workers-0-0-57jvg 1/1 Running 0 67s coordinator-workers-0-1-mghvx 1/1 Running 0 67s coordinator-workers-0-2-7cnvv 1/1 Running 0 67s
Failure policy configuration for JobSet Operator
To control workload behavior in response to child job failures, you can configure a JobSet failure policy. This enables you to define specific actions, such as restarting or failing the entire JobSet, based on the failure reason or the specific replicated job affected.
Failure policy actions
These actions are available when a job failure matches a defined rule.
| Action | Description |
|---|---|
|
Marks the entire JobSet as failed immediately. |
|
Restarts the JobSet by recreating all child jobs. This action counts toward the |
|
Restarts the JobSet without counting toward the |
Rule-targeting attributes
Use the following attributes to define failure rules.
| Attribute | Description |
|---|---|
|
Specifies which replicated jobs trigger the rule. |
|
Triggers the rule based on the specific job failure reason. Valid values include |
Configuration example
This configuration marks the JobSet as failed if the leader job fails.
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: failjobset-action-example
spec:
failurePolicy:
maxRestarts: 3
rules:
- action: FailJobSet
targetReplicatedJobs:
- leader
replicatedJobs:
- name: leader
replicas: 1
template:
spec:
backoffLimit: 0
completions: 2
parallelism: 2
template:
spec:
containers:
- name: leader
image: docker.io/bash:latest
command:
- bash
- -xc
- |
echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
for i in $(seq 10 -1 1)
do
echo "Sleeping in $i"
sleep 1
done
exit 1
fi
for i in $(seq 1 1000)
do
echo "$i"
sleep 1
done
- name: workers
replicas: 1
template:
spec:
backoffLimit: 0
completions: 2
parallelism: 2
template:
spec:
containers:
- name: worker
image: docker.io/bash:latest
command:
- bash
- -xc
- |
sleep 1000
Configuring volume claim policies for JobSet Operator
You can configure a JobSet to automatically create and manage shared persistent volume claims (PVCs) across multiple replicated jobs. This is useful for workloads that require shared access to datasets, models, or checkpoints.
-
You have the JobSet Operator installed in your cluster.
-
You have set a default storage class or chosen a storage class for your workload.
-
Define the volume templates in the
spec.volumeClaimPoliciessection of your JobSet YAML file.apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: <job_name> spec: volumeClaimPolicies: - templates: - metadata: name: <persistent_volume_claim_name_prefix> spec: accessModes: ["ReadWriteOnce"] storageClassName: mystorageclass resources: requests: storage: 1Gi retentionPolicy: whenDeleted: Retainwhere:
<job_name>-
Specifies your unique identifier for your jobs within your namespace.
<persistent_volume_claim_name>-
Specifies the name for the PVC. The name used here will also be used as the
volumeMountsname. A volume will be automatically added to the pod that will mount a PVC created with a name in the format of<persistent_volume_claim_name>-<job_name>. <deletion_retention_policy>-
Specifies the deletion retention policy. Optionally, you can keep data after the JobSet is deleted by setting this value to
Retain.
-
In your
replicatedJobsconfiguration, add avolumeMountthat matches the template name you defined.apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: <job_name> spec: replicatedJobs: - name: workers template: spec: parallelism: 2 completions: 2 backoffLimit: 0 template: spec: imagePullSecrets: - name: my-registry-secret initContainers: - name: prepare image: docker.io/alpine/git:v2.52.0 args: ['clone', 'https://github.com/pytorch/examples'] volumeMounts: - name: <persistent_volume_claim_name> mountPath: /git/checkpoint #... -
Apply the JobSet configuration by running the following command:
$ oc apply -f <jobset_yaml>
-
Verify that the PVCs were created using the naming convention
<claim_name>-<jobset_name>:$ oc get pvcExample outputNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Bound pvc-385996a0-70af-4791-aa8e-9e6459e6b123 3Gi RWO file-storage 3d pvc-2 Bound pvc-8aeddd4d-aad5-4039-8d04-640a71c9a72d 12Gi RWO file-storage 3d pvc-3 Bound pvc-0050144d-940c-4c4e-a23a-2a660a5490eb 12Gi RWO file-storage 3d