AWS Batch

AWS Batch manages batch job scheduling and execution on managed compute infrastructure. You define job definitions (what to run), job queues (where to run), and compute environments (how to run). AWS Batch handles provisioning, scaling, and cleanup of compute resources.

Core Concepts

How AWS Batch Works

Job Definition (what to run)
  │
  ▼
Job Queue (priority, compute environment)
  │
  ▼
Compute Environment (EC2 or Fargate)
  │
  ▼
Job Scheduler (runs jobs based on priority, dependencies)

Key Terms

Term	Description
Job Definition	Blueprint for a job (image, resources, retry strategy)
Job	A running instance of a job definition
Job Queue	Queue with priority, maps to compute environment
Compute Environment	Managed EC2 or Fargate infra
Job Scheduler	AWS Batch scheduler (runs jobs in order)

Job Definitions

{
  "jobDefinitionName": "my-batch-job",
  "type": "container",
  "containerProperties": {
    "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-batch:latest",
    "vcpus": 2,
    "memory": 4096,
    "command": ["python", "process.py", "Ref::input_file"],
    "environment": [
      {"name": "BATCH_LOG_LEVEL", "value": "INFO"}
    ],
    "readonlyRootFilesystem": true,
    "privileged": false
  },
  "retryStrategy": {
    "attempts": 3,
    "evaluateOnExit": [
      {"action": "RETRY", "onStatusReason": "HostUsageError"},
      {"action": "EXIT", "onStatusReason": "TaskFailed"}
    ]
  },
  "timeout": {
    "attemptDurationSeconds": 3600
  }
}

Parameters (Job Template)

{
  "parameters": {
    "input_file": "s3://my-bucket/data/input.csv"
  }
}

Override at submit time:

aws batch submit-job \
  --job-name my-run \
  --job-definition my-batch-job \
  --job-queue my-queue \
  --parameters input_file=s3://my-bucket/data/new-input.csv

Compute Environments

Fargate (Serverless)

aws batch create-compute-environment \
  --compute-environment-name my-fargate-env \
  --type MANAGED \
  --service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole \
  --compute-resources '{
    "type": "FARGATE",
    "maxvCpus": 256,
    "subnets": ["subnet-xxxxx", "subnet-yyyyy"],
    "securityGroupIds": ["sg-xxxxx"]
  }' \
  --state ENABLED

EC2 (Managed)

aws batch create-compute-environment \
  --compute-environment-name my-ec2-env \
  --type MANAGED \
  --service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole \
  --compute-resources '{
    "type": "EC2",
    "minvCpus": 0,
    "desiredvCpus": 0,
    "maxvCpus": 256,
    "instanceTypes": ["m5", "m5d"],
    "subnets": ["subnet-xxxxx", "subnet-yyyyy"],
    "securityGroupIds": ["sg-xxxxx"],
    "instanceRole": "arn:aws:iam::123456789012:instance-profile/batch-instance-role"
  }' \
  --state ENABLED

Spot (Cheaper)

aws batch create-compute-environment \
  --compute-environment-name my-spot-env \
  --type MANAGED \
  --service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole \
  --compute-resources '{
    "type": "SPOT",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 0,
    "maxvCpus": 256,
    "instanceTypes": ["m5", "c5"],
    "bidPercentage": 50,
    "subnets": ["subnet-xxxxx"],
    "securityGroupIds": ["sg-xxxxx"],
    "instanceRole": "arn:aws:iam::123456789012:instance-profile/batch-instance-role"
  }' \
  --state ENABLED

Job Queues

# Create queue
aws batch create-job-queue \
  --job-queue-name my-queue \
  --priority 1 \
  --compute-environment-order '[{"computeEnvironment": "my-fargate-env", "order": 1}]' \
  --state ENABLED
 
# Create with multiple compute environments (priority)
aws batch create-job-queue \
  --job-queue-name production-queue \
  --priority 10 \
  --compute-environment-order '[
    {"computeEnvironment": "my-fargate-env", "order": 1},
    {"computeEnvironment": "my-ec2-env", "order": 2}
  ]' \
  --state ENABLED

Job queues can have multiple compute environments with different priority. Jobs try the first environment, fall back to the next if insufficient resources.

Submitting Jobs

Simple Job

aws batch submit-job \
  --job-name my-analysis \
  --job-definition my-batch-job \
  --job-queue my-queue

Array Job (parallel processing)

aws batch submit-job \
  --job-name my-array \
  --job-definition my-batch-job \
  --job-queue my-queue \
  --array-properties size=100

100 jobs run in parallel. Each job can reference its array index:

import os
array_index = os.environ.get('AWS_BATCH_JOB_ARRAY_INDEX')

Multi-node Parallel Job

For MPI/HPC workloads:

aws batch submit-job \
  --job-name my-mpi \
  --job-definition my-mpi-job \
  --job-queue my-queue \
  --node-properties '{
    "numNodes": 4,
    "mainNode": 0,
    "nodeRangeProperties": [{
      "targetNodes": "0:3",
      "container": {
        "image": "my-mpi-image",
        "vcpus": 8,
        "memory": 16384
      }
    }]
  }'

Job Dependencies

# Job 2 depends on Job 1 completing successfully
aws batch submit-job \
  --job-name job2 \
  --job-definition my-batch-job \
  --job-queue my-queue \
  --depends-on '[{"jobId": "xxxxx", "type": "N_TO_N"}]'

Monitoring

# List jobs
aws batch list-jobs --job-queue my-queue --job-status RUNNABLE
 
# Describe job
aws batch describe-jobs --jobs xxxxx
 
# Get job logs
aws batch describe-job-log-groups
# Or: CloudWatch Logs (if configured in job definition)

CloudWatch Events

Monitor job state changes:

aws events put-rule \
  --name batch-job-events \
  --event-pattern '{
    "source": ["aws.batch"],
    "detail-type": ["AWS Batch Job State Change"]
  }'

Retry Strategy

{
  "retryStrategy": {
    "attempts": 3,
    "evaluateOnExit": [
      {"action": "RETRY", "onReason": "HostUsageError"},
      {"action": "RETRY", "onReason": "NonZeroExitCode"},
      {"action": "EXIT", "onReason": "TaskFailed"}
    ]
  }
}

Common onStatusReason values:

HostUsageError — resource exhaustion, retry
TaskFailed — task failed, don’t retry
JobTimeout — job timed out, retry

Pricing

Component	Cost
EC2 (on-demand)	$0.096/hr (m5.xlarge)
EC2 (Spot)	70-90% off
Fargate	$0.04048/ v CP U - h r +$ 0.00444/GB-hr
No charge	Job scheduling, queues

Architecture: Batch Processing Pipeline

S3 (input bucket)
  │
  ▼
Lambda (trigger on new file)
  │
  ▼
AWS Batch (submit job)
  │
  ├── Job 1 (EC2/Spot, 8 vCPU, 16GB)
  ├── Job 2 (EC2/Spot, 8 vCPU, 16GB)
  └── Job 100 (EC2/Spot, 8 vCPU, 16GB)
         │
         ▼
       S3 (output bucket)
         │
         ▼
       SNS (notify completion)

References

Homepage: https://aws.amazon.com/batch/
Documentation: https://docs.aws.amazon.com/batch/
Pricing: https://aws.amazon.com/batch/pricing/

Pricing Examples

Scenario 1: A nightly data processing job (8 vCPU, 16GB) running 5 hours on 10 parallel nodes. Fargate: 10 × 8 vCPU × 5hr × $0.04048 =$ 16.19. Plus memory: 10 × 16GB × 5hr × $0.00444 =$ 3.55. Total: ~ $19.74 p err u n .30 r u n s / m o n t h =$ 592/month.

Scenario 2: The same job on EC2 Spot (70% savings). EC2 m5.xlarge (4 vCPU, 16GB) Spot: $0.058/ h r \times 20 in s t an ces \times 5 h r =$ 5.80/job × 30 = $174/month. Fargate is 3.4x more expensive but requires no EC2 management.

Scenario 3: An ML training job (64 vCPU, 256GB) running 10 hours once. EC2 Spot (c5.16xlarge = 64 vCPU): $1.38/ h r \times 10 h r \times 1 in s t an ce =$ 13.80/job. On-demand: $2.61/ h r \times 10 =$ 26.10. Spot saves $12.30 p err u n .20 r u n s / m o n t h =$ 246 savings.

Nuggets & Gotchas

AWS Batch doesn’t have a built-in retry for Spot interruptions — configure evaluateOnExit: When Spot interrupts a job, Batch marks it as FAILED. Use "onStatusReason": "HostUsageError" in retry strategy to automatically resubmit interrupted jobs.
Array jobs share the same job definition — each array index gets its own job: If you need different parameters per array index, use AWS_BATCH_JOB_ARRAY_INDEX in your code to determine which chunk of data to process.
Fargate compute environments have a 16-vCPU limit per job — for larger jobs use EC2: If you try to submit a job with 32 vCPU to a Fargate environment, it will fail. Use EC2 for high-vCPU workloads.
Jobs timeout based on attemptDurationSeconds — if your job takes > 1 hour and you forget to set timeout, it will fail: Default timeout is infinite (no timeout). Set timeout.attemptDurationSeconds to a value slightly above your expected runtime.
The jobQueue parameter on submit-job is required — don’t confuse it with computeEnvironmentOrder: The queue is where you submit jobs. The compute environment is what the queue maps to. You can’t submit directly to a compute environment.

cloudnative wiki

Explorer

AWS Batch

AWS Batch

Core Concepts

How AWS Batch Works

Key Terms

Job Definitions

Parameters (Job Template)

Compute Environments

Fargate (Serverless)

EC2 (Managed)

Spot (Cheaper)

Job Queues

Submitting Jobs

Simple Job

Array Job (parallel processing)

Multi-node Parallel Job

Job Dependencies

Monitoring

CloudWatch Events

Retry Strategy

Pricing

Architecture: Batch Processing Pipeline

References

Pricing Examples

Nuggets & Gotchas

Graph View

Table of Contents