Job Manager and job API
Job Manager and job API
The Job Manager, aka "job-scheduler", is a web API service, that you use to create, delete and monitor the state of jobs.
Radix creates one job-scheduler per job defined in radixconfig.yaml. A job-scheduler will listen to the port defined by schedulerPort and host name equal to the name of the job. The job-scheduler API can only be accessed by components running in the same environment, and it is not exposed to the Internet. No authentication is required.
The Job Manager exposes the following methods for managing jobs:
- GET /api/v1/jobsGet states (with names and statuses) for all jobs
- GET /api/v1/jobs/{jobName}Get state for a named job
- DELETE /api/v1/jobs/{jobName}Delete a named job
- POST /api/v1/jobs/{jobName}/stopStop a named job
... and the following methods for managing batches:
- GET /api/v1/batchesGet states (with names and statuses) for all batches
- GET /api/v1/batches/{batchName}Get state for a named batch and statuses of its jobs
- DELETE /api/v1/batches/{batchName}Delete a named batch
- POST /api/v1/batches/{batchName}/stopStop a named batch
- POST /api/v1/batches/{batchName}/jobs/{jobName}/stopStop a named job of a batch
Create a single job
- POST /api/v1/jobsCreate a new job using the Docker image that Radix built for the job. Job-specific arguments can be sent in the request body
{
  "payload": "Sk9CX1BBUkFNMTogeHl6Cg==",
  "jobId": "my-job-1",
  "imageTagName": "1.0.0",
  "timeLimitSeconds": 120,
  "backoffLimit": 10,
  "failurePolicy": {
    "rules": [
      {
        "action": "FailJob",
        "onExitCodes": {
          "operator": "In",
          "values": [
            42
          ]
        }
      }
    ]
  },
  "resources": {
    "limits": {
      "memory": "32Mi",
      "cpu": "300m"
    },
    "requests": {
      "memory": "16Mi",
      "cpu": "150m"
    }
  },
  "runtime": {
    "nodeType": "memory-optimized-2-v1"
  },
  "variables": {
    "INPUT_FILE_NAME": "chart-2025-07-15.json",
    "OUTPUT_FILE_NAME": "result-2025-07-15.json",
    "TRAINING_EPOCHS": "10"
  },
  "command": ["./run.sh"],
  "args": ["--input", "/data/input.json", "--output", "/data/output.json"]
}
Parameters
payload, jobId, image, imageTagName, timeLimitSeconds, backoffLimit, failurePolicy, resources, runtime, variables, command, args are optional fields and any of them can be omitted in the request.
image
image field allows to alter specific job's image
imageTagName
imageTagName field allows to replace an image tag for specific job - it is not necessary to configure {imageTagName} in the radixconfig.yaml for it.
variables
variables can add or override for a specific job variables configured for a job component. It can be used to pass arguments to the job instead of payload.
command
command - sets or overrides ENTRYPOINT directive array in a docker image. It can also override the job-component's command if it exists. Read more about command
When command is set and a Dockerfile used by the job-component has CMD directive (having a shell command or arguments to a command defined in ENTRYPOINT), this CMD directive will be ignored.
When command field is set to an empty array [], it will suppress command on the job-component or its environmentConfig level if exists, an ENTRYPOINT directive in the Dockerfile will be used if defined.
args
args - sets or overrides CMD directive array in a docker image. It can also override the job-component's args if it exists. Read more about args
When args field is set to an empty array [], it will suppress args on the job-component or its environmentConfig level if exists, an CMD directive in the Dockerfile will be used if defined.
Create a batch of jobs
- POST /api/v1/batchesCreate a new batch of single jobs, using the Docker image, that Radix built for the job component. Job-specific arguments can be sent in the request body, specified individually for each item in- jobScheduleDescriptionswith default values defined in- defaultRadixJobComponentConfig.
{
  "batchId": "random-batch-id-123",
  "defaultRadixJobComponentConfig": {
    "imageTagName": "1.0.0",
    "timeLimitSeconds": 200,
    "backoffLimit": 5,
    "resources": {
      "limits": {
        "memory": "200Mi",
        "cpu": "200m"
      },
      "requests": {
        "memory": "100Mi",
        "cpu": "100m"
      },
      "runtime": {
        "architecture": "amd64"
      },
      "variables": {
        "TRAINING_EPOCHS": "5"
      }
    }
  },
  "jobScheduleDescriptions": [
    {
      "payload": "{'data':'value1'}",
      "jobId": "my-job-1",
      "imageTagName": "1.0.0",
      "timeLimitSeconds": 120,
      "backoffLimit": 10,
      "resources": {
        "limits": {
          "memory": "32Mi",
          "cpu": "300m"
        },
        "requests": {
          "memory": "16Mi",
          "cpu": "150m"
        }
      },
      "runtime": {
        "nodeType": "memory-optimized-2-v1"
      },
      "variables": {
        "INPUT_FILE_NAME": "chart-2025-07-15.json",
        "OUTPUT_FILE_NAME": "result-2025-07-15.json"
      }
    },
    {
      "payload": "{'data':'value2'}",
      "jobId": "my-job-2",
      ...
      "variables": {
        "INPUT_FILE_NAME": "chart-2025-07-16.json",
        "OUTPUT_FILE_NAME": "result-2025-07-16.json",
        "TRAINING_EPOCHS": "10"
      }
    },
    {
      "payload": "{'data':'value3'}",
      ...
      "variables": {
        "INPUT_FILE_NAME": "chart-2025-07-17.json",
        "OUTPUT_FILE_NAME": "result-2025-07-17.json"
      }
    }
  ]
}
Parameters
Parameters are the same as described in the Create a single job section, with the following differences:
- Parameters can be defined in both defaultRadixJobComponentConfigandjobScheduleDescriptionsitems, individually for each job configuration
- A parameter defined in a jobScheduleDescriptionsitem overrides the same parameter indefaultRadixJobComponentConfigand on a job component or itsenvironmentConfiglevels.
- variablesdefined in- defaultRadixJobComponentConfigand/or in- jobScheduleDescriptionsitems are combined and add or override variables configured for a job component.
- When final commandis set to an empty array[]in anjobScheduleDescriptionsitem anddefaultRadixJobComponentConfig, for this batch or a specific job it suppressescommanddefined on a job-component or itsenvironmentConfiglevel if exists, an ENTRYPOINT directive in the Dockerfile will be used if defined.
- When final argsis set to an empty array[]in anjobScheduleDescriptionsitem anddefaultRadixJobComponentConfig, for this batch or a specific job it suppressesargsdefined on a job-component or itsenvironmentConfiglevel if exists, an CMD directive in the Dockerfile will be used if defined.
Starting a new job
The example configuration at the top has component named backend and two jobs, compute and etl. Radix creates two job-schedulers, one for each of the two jobs. The job-scheduler for compute listens to http://compute:8000, and job-scheduler for etl listens to http://etl:9000.
To start a new single job, send a POST request to http://compute:8000/api/v1/jobs with request body set to
{
  "payload": "{\"x\": 10, \"y\": 20}"
}
The job-scheduler creates a new job and mounts the payload from the request body to a file named payload in the directory /compute/args.
Once the job has been created successfully, the job-scheduler responds to backend with a job state object
{
  "name": "batch-compute-20230220101417-idwsxncs-rkwaibwe",
  "started": "",
  "ended": "",
  "status": "Running"
}
- nameis the unique name for the job. This is the value to be used in the- GET /api/v1/jobs/{jobName}and- DELETE /api/v1/jobs/{jobName}methods. It is also the host name to connect to running job's container, with its exposed port, e.g.- http://batch-compute-20230220100755-xkoxce5g-mll3kxxh:3000
- startedis the date and time the job was started. It is represented in RFC3339 form and is in UTC.
- endedis the date and time the job successfully ended. Also represented in RFC3339 form and is in UTC. This value is only set for- Succeededjobs.
- statusis the current status of the job. Possible values are- Waiting,- Stopping,- Stopped,- Active,- Running,- Succeeded,- Failed.- Activestatus means that the job has a replica created, but this replica is not ready (due to such reasons as volume mount is not ready, or it is a problem to schedule replica on a node because not enough memory available, etc.), this status can remain forever. Status- Failedif the job's replica container exits with a non-zero exit code, and- Succeededif the exit code is zero.
Getting the status of all existing jobs
Get a list of all single jobs with their states by sending a GET request to http://compute:8000/api/v1/jobs. The response is an array of job state objects, similar to the response received when creating a new job. Jobs that have been started within a batch are not included in this list
[
  {
    "name": "batch-compute-20230220100755-xkoxce5g-mll3kxxh",
    "started": "2021-04-07T09:08:37Z",
    "ended": "2021-04-07T09:08:45Z",
    "status": "Succeeded"
  },
  {
    "name": "batch-compute-20230220101417-idwsxncs-rkwaibwe",
    "started": "2021-04-07T10:55:56Z",
    "ended": "",
    "status": "Failed"
  }
]
To get state for a specific job (single or one within a batch), e.g. batch-compute-20230220100755-xkoxce5g-mll3kxxh, send a GET request to http://compute:8000/api/v1/jobs/batch-compute-20230220100755-xkoxce5g-mll3kxxh. The response is a single job state object
{
  "name": "batch-compute-20230220100755-xkoxce5g-mll3kxxh",
  "started": "2021-04-07T09:08:37Z",
  "ended": "2021-04-07T09:08:45Z",
  "status": "Succeeded"
}
Deleting an existing job
The job list in the example above has a job named batch-compute-20230220101417-idwsxncs-rkwaibwe. To delete it, send a DELETE request to http://compute:8000/api/v1/jobs/batch-compute-20230220101417-idwsxncs-rkwaibwe. A successful deletion will respond with result object. Only single job can be deleted with this method
{
  "status": "Success",
  "message": "job batch-compute-20230220101417-idwsxncs-rkwaibwe successfully deleted",
  "code": 200
}
Stop a job
The job list in the example above has a job named batch-compute-20230220100755-xkoxce5g-mll3kxxh. To stop it, send a POST request to http://compute:8000/api/v1/jobs/batch-compute-20230220100755-xkoxce5g-mll3kxxh/stop. A successful stop will respond with result object. Only single job can be stopped with this method. Stop of a job automatically deletes corresponding Kubernetes job and its replica, as well as its log. The job will get the status "Stopped".
{
  "status": "Success",
  "message": "job batch-compute-20230220100755-xkoxce5g-mll3kxxh successfully stopped",
  "code": 200
}
{
  "status": "Success",
  "message": "job batch-compute-20230220101417-idwsxncs-rkwaibwe successfully stopped",
  "code": 200
}
Starting a new batch of jobs
To start a new batch of jobs, send a POST request to http://compute:8000/api/v1/batches with request body set to
{
  "jobScheduleDescriptions": [
    {
      "payload": "{\"x\": 10, \"y\": 20}"
    },
    {
      "payload": "{\"x\": 20, \"y\": 30}"
    }
  ]
}
Batch ID
Batch can have batchId - it is an optional string, where any value can be put. Radix does not process it. It can exist in a batchScheduleDescription (a request body json) for a batch.
If the batchId is specified, it will be returned in the batch status, and it will be shown in the Radix console in the batch list.
Job ID
Jobs can have jobId - it is an optional string, where any value can be put. Radix does not process it. It can exist in a jobScheduleDescription for a single job or in batch jobs
If the jobId is specified, it will be returned in the job's status, and it will be shown in the Radix console in the job list.
Job ID in a single job
{
  "jobId": "my-job",
  "payload": "{\"x\": 10, \"y\": 20}"
}
Job ID in the batch jobs
{
  "jobScheduleDescriptions": [
    {
      "jobId": "my-job-1",
      "payload": "{\"x\": 10, \"y\": 20}"
    },
    {
      "jobId": "my-job-2",
      "payload": "{\"x\": 20, \"y\": 30}"
    }
  ]
}
Default parameters for jobs can be defined within DefaultRadixJobComponentConfig. These parameters can be overridden for each job individually in JobScheduleDescriptions
{
  "defaultRadixJobComponentConfig": {
    "imageTagName": "1.0.0",
    "timeLimitSeconds": 200,
    "backoffLimit": 5,
    "resources": {
      "limits": {
        "memory": "200Mi",
        "cpu": "200m"
      },
      "requests": {
        "memory": "100Mi",
        "cpu": "100m"
      }
    },
    "command": ["./run.sh"]
  },
  "jobScheduleDescriptions": [
    {
      "payload": "{'data':'value1'}",
      "timeLimitSeconds": 120,
      "backoffLimit": 2,
      "resources": {
        "limits": {
          "memory": "32Mi",
          "cpu": "300m"
        },
        "requests": {
          "memory": "16Mi",
          "cpu": "150m"
        }
      },
      "runtime": {
        "nodeType": "memory-optimized-2-v1"
      },
      "args": ["--input", "/data/input-2025-07-16.json", "--output", "/data/output-2025-07-16.json"]
    },
    {
      "payload": "{'data':'value2'}",
      "imageTagName": "2.0.0"
    },
    {
      "payload": "{'data':'value3'}",
      "timeLimitSeconds": 300,
      "backoffLimit": 10,
      "runtime": {},
      "command": ["./calculate.sh", "--epochs", "10"],
      "args": ["--input", "/data/input-ml.json", "--output", "/data/output-ml.json"]
    }
  ]
}
The job-scheduler creates a new batch, which will create single jobs for each item in the JobScheduleDescriptions.
Once the batch has been created, the job-scheduler responds to backend with a batch state object
{
  "batchName": "batch-compute-20220302170647-6ytkltvk",
  "name": "batch-compute-20220302170647-6ytkltvk-tlugvgs",
  "created": "2022-03-02T17:06:47+01:00",
  "status": "Running"
}
- batchNameis the unique name for the batch. This is the value to be used in the- GET /api/v1/batches/{batchName}and- DELETE /api/v1/batches/{batchName}methods.
- startedis the date and time the batch was started. The value is represented in RFC3339 form and is in UTC.
- endedis the date and time the batch successfully ended (empty when not completed). The value is represented in RFC3339 form and is in UTC. This value is only set for- Succeededbatches. Batch is ended when all batched jobs are completed or failed.
- statusis the current status of the batch. Possible values are- Running,- Succeededand- Failed. Status is- Failedif the batch fails for any reason.
Get a list of all batches
Get a list of all batches with their states by sending a GET request to http://compute:8000/api/v1/batches. The response is an array of batch state objects, similar to the response received when creating a new batch
[
  {
    "name": "batch-compute-20220302155333-hrwl53mw",
    "created": "2022-03-02T15:53:33+01:00",
    "started": "2022-03-02T15:53:33+01:00",
    "ended": "2022-03-02T15:54:00+01:00",
    "status": "Succeeded"
  },
  {
    "name": "batch-compute-20220302170647-6ytkltvk",
    "created": "2022-03-02T17:06:47+01:00",
    "started": "2022-03-02T17:06:47+01:00",
    "status": "Running"
  }
]
Get a state of a batch
To get state for a specific batch, e.g. batch-compute-20220302155333-hrwl53mw, send a GET request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw. The response is a batch state object, with states of its jobs and their replicas (pods) statuses.
{
  "name": "batch-compute-20220302155333-hrwl53mw",
  "created": "2022-03-02T15:53:33+01:00",
  "started": "2022-03-02T15:53:33+01:00",
  "ended": "2022-03-02T15:54:00+01:00",
  "status": "Succeeded",
  "updated": "2022-03-02T15:54:00+01:00",
  "jobStatuses": [
    {
      "jobId": "job1",
      "batchName": "batch-compute-20220302155333-hrwl53mw",
      "name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7",
      "created": "2022-03-02T15:53:36+01:00",
      "started": "2022-03-02T15:53:36+01:00",
      "ended": "2022-03-02T15:53:56+01:00",
      "status": "Succeeded",
      "updated": "2022-03-02T15:53:56+01:00",
      "podStatuses": [
        {
          "name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7-5sfnl",
          "created": "2022-03-02T15:53:36Z",
          "startTime": "2022-03-02T15:53:36Z",
          "endTime": "2022-03-02T15:53:56Z",
          "containerStarted": "2022-03-02T15:53:36Z",
          "replicaStatus": {
            "status": "Succeeded"
          },
          "image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
          "imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
          "exitCode": 0,
          "reason": "Completed"
        }
      ]
    },
    {
      "jobId": "job2",
      "batchName": "batch-compute-20220302155333-hrwl53mw",
      "name": "batch-compute-20220302155333-hrwl53mw-qjzykhrd",
      "created": "2022-03-02T15:53:39+01:00",
      "started": "2022-03-02T15:53:39+01:00",
      "ended": "2022-03-02T15:53:56+01:00",
      "status": "Succeeded",
      "updated": "2022-03-02T15:53:56+01:00",
      "podStatuses": [
        {
          "name": "batch-compute-20220302155333-hrwl53mw-qjzykhrd-5sfnl",
          "created": "2022-03-02T15:53:39Z",
          "startTime": "2022-03-02T15:53:40Z",
          "endTime": "2022-03-02T15:53:56Z",
          "containerStarted": "2022-03-02T15:53:40Z",
          "replicaStatus": {
            "status": "Succeeded"
          },
          "image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
          "imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
          "exitCode": 0,
          "reason": "Completed"
        }
      ]
    }
  ]
}
If the job's replica failed and job-component has backoffLimit greater then 0, podStatus contains exitCode and reason for failed pods. podIndex gives an order of pod statuses (starting from 0)
{
  "name": "batch-compute-20220302155333-hrwl53mw",
  "created": "2022-03-02T15:53:33+01:00",
  "started": "2022-03-02T15:53:33+01:00",
  "ended": "2022-03-02T15:53:48+01:00",
  "status": "Failed",
  "updated": "2022-03-02T15:53:48+01:00",
  "jobStatuses": [
    {
      "jobId": "job1",
      "batchName": "batch-compute-20220302155333-hrwl53mw",
      "name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7",
      "created": "2022-03-02T15:53:36+01:00",
      "started": "2022-03-02T15:53:36+01:00",
      "ended": "2022-03-02T15:53:56+01:00",
      "status": "Failed",
      "message": "Job has reached the specified backoff limit",
      "updated": "2022-03-02T15:53:56+01:00",
      "podStatuses": [
        {
          "name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7-wbn9q",
          "created": "2022-03-02T15:53:36Z",
          "startTime": "2022-03-02T15:53:36Z",
          "endTime": "2022-03-02T15:53:40Z",
          "containerStarted": "2022-03-02T15:53:36Z",
          "replicaStatus": {
            "status": "Failed"
          },
          "image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
          "imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
          "exitCode": 1,
          "reason": "Error"
        },
        {
          "name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7-859xq",
          "created": "2022-03-02T15:53:40Z",
          "startTime": "2022-03-02T15:53:42Z",
          "endTime": "2022-03-02T15:53:48Z",
          "containerStarted": "2022-03-02T15:53:42Z",
          "replicaStatus": {
            "status": "Failed"
          },
          "image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
          "imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
          "podIndex": 1,
          "exitCode": 1,
          "reason": "Error"
        }
      ]
    }
  ]
}
Delete a batch
The batch list in the example above has a batch named batch-compute-20220302155333-hrwl53mw. To delete it, send a DELETE request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw. A successful deletion will respond with result object. Deleting of a batch job automatically deletes all jobs, belonging to this batch job.
{
  "status": "Success",
  "message": "batch batch-compute-20220302155333-hrwl53mw successfully deleted",
  "code": 200
}
Stop an existing batch
The batch list in the example above has a batch named batch-compute-20220302155333-hrwl53mw. To stop it, send a POST request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw/stop. A successful stop will respond with result object. Stop of a batch automatically deletes all batch Kubernetes jobs and their replicas, belonging to this batch job, as well as their logs. All not completed jobs will get the status "Stopped".
{
  "status": "Success",
  "message": "batch batch-compute-20220302155333-hrwl53mw successfully stopped",
  "code": 200
}
Stop a jobs in a batch
The batch list in the example above has a batch named batch-compute-20220302155333-hrwl53mw and jobs, one of whicvh has name batch-compute-20220302155333-hrwl53mw-fjhcqwj7. To stop this job, send a POST request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw/jobs/batch-compute-20220302155333-hrwl53mw-fjhcqwj7/stop. A successful stop will respond with result object. Stop of a batch job automatically deletes corresponding Kubernetes job and its replica, as well as its log. The job will get the status "Stopped".
{
  "status": "Success",
  "message": "job batch-compute-20220302155333-hrwl53mw-fjhcqwj7 in the batch batch-compute-20220302155333-hrwl53mw successfully stopped",
  "code": 200
}