Deploying Airflow Worker

Apache Airflow Worker is a critical component of the Apache Airflow distributed workflow orchestration system. Workers execute tasks that are scheduled by the Airflow scheduler, providing horizontal scalability for processing complex data pipelines and workflows. By deploying Airflow Workers on Klutch.sh, you can leverage scalable cloud infrastructure, automated deployments from GitHub, persistent storage for logs, and flexible compute resources to handle intensive task execution workloads.

This comprehensive guide walks you through deploying Apache Airflow Workers on Klutch.sh using a Dockerfile. You’ll learn how to set up your environment, create a production-ready Dockerfile, configure persistent storage for logs and task metadata, manage environment variables for connecting to your Airflow infrastructure, and follow best practices for scaling and monitoring your worker deployment.

Prerequisites

Before deploying an Airflow Worker to Klutch.sh, ensure you have:

A Klutch.sh account
A GitHub repository for your Airflow Worker project
A running Airflow scheduler and webserver instance (see the Airflow deployment guide)
A message broker (Redis or RabbitMQ) for Celery task distribution
A metadata database (PostgreSQL recommended) shared with your scheduler
Basic knowledge of Docker, Python, and Airflow architecture
Understanding of distributed systems and message queues

Understanding Airflow Worker Architecture

Before deployment, it’s important to understand how Airflow Workers fit into the overall Airflow architecture:

Scheduler: Monitors DAGs and assigns tasks to the message queue
Message Broker: Distributes tasks to available workers (Redis or RabbitMQ)
Workers: Execute tasks and report results back through the result backend
Metadata Database: Stores task states, DAG definitions, and execution history
Webserver: Provides the UI for monitoring (separate from workers)

Workers communicate with the scheduler through a message broker using the Celery executor. Multiple workers can be deployed to scale task execution horizontally, enabling parallel processing of multiple tasks across different workflows.

Getting Started: Install Airflow Worker Locally

Before deploying to Klutch.sh, it’s helpful to understand how Airflow Workers operate by testing locally. This section demonstrates setting up a worker that connects to an existing Airflow infrastructure.

Create a project directory and set up a virtual environment:

mkdir airflow-worker
cd airflow-worker
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Apache Airflow with Celery support:

pip install "apache-airflow[celery,redis]==2.7.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt"

Set essential environment variables:

export AIRFLOW_HOME=$(pwd)/airflow
export AIRFLOW__CORE__EXECUTOR=CeleryExecutor
export AIRFLOW__CELERY__BROKER_URL=redis://localhost:6379/0
export AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@localhost:5432/airflow
export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@localhost:5432/airflow

Note: Replace the connection strings with your actual Redis and PostgreSQL connection details.

Ensure your DAGs directory exists:
Terminal window
```
mkdir -p $AIRFLOW_HOME/dags
```
Start the Airflow worker:
Terminal window
```
airflow celery worker
```
The worker will connect to your message broker and begin listening for tasks from the scheduler.
Verify worker operation:

In a separate terminal, check worker status through the Airflow webserver UI or via:
Terminal window
```
airflow celery inspect active
```

Once your local worker connects successfully and can execute tasks from your scheduler, you’re ready to containerize and deploy to Klutch.sh.

Deploying Airflow Worker with a Dockerfile

Klutch.sh automatically detects a Dockerfile in your repository’s root directory and uses it to build and deploy your application. This approach provides complete control over your worker environment, dependencies, and configuration.

Create a Dockerfile in your project root:

# Use the official Apache Airflow image as base
FROM apache/airflow:2.7.3-python3.11

# Set working directory
WORKDIR /opt/airflow

# Switch to root to install system dependencies if needed
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libpq-dev \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Switch back to airflow user for security
USER airflow

# Copy and install Python dependencies
COPY requirements.txt /opt/airflow/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy DAGs directory (workers need access to DAG definitions)
COPY dags /opt/airflow/dags

# Copy plugins if you have custom operators or hooks
COPY plugins /opt/airflow/plugins

# Set environment variables
ENV AIRFLOW_HOME=/opt/airflow
ENV AIRFLOW__CORE__LOAD_EXAMPLES=False
ENV AIRFLOW__CORE__EXECUTOR=CeleryExecutor

# Expose port for worker monitoring (optional)
EXPOSE 8793

# Start the Celery worker
CMD ["celery", "--app", "airflow.providers.celery.executors.celery_executor.app", "worker", "--concurrency", "4", "--loglevel", "INFO"]

Create a requirements.txt file:

apache-airflow[celery,redis]==2.7.3
apache-airflow-providers-postgres==5.7.1
psycopg2-binary==2.9.9
redis==5.0.1
celery==5.3.4
# Add any custom dependencies your DAGs require
pandas==2.1.3
requests==2.31.0
boto3==1.29.7

Create necessary directory structure:
Terminal window
```
mkdir -p dags plugins logs config
```
Copy your DAG files to the dags directory:

Workers need access to the same DAG definitions as your scheduler. Either:
- Copy DAGs into the image (shown in Dockerfile above)
- Or mount a shared volume containing DAGs (covered in persistent storage section)

Create a .dockerignore file:

venv/
__pycache__/
*.pyc
*.pyo
*.pyd
.git/
.gitignore
.env
.DS_Store
*.log
logs/
airflow.db
airflow.cfg
webserver_config.py

Test your Docker build locally:

docker build -t airflow-worker:local .
docker run --rm \
  -e AIRFLOW__CELERY__BROKER_URL=redis://your-redis:6379/0 \
  -e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://user:pass@db:5432/airflow \
  airflow-worker:local

Initialize a Git repository and push to GitHub:

git init
git add .
git commit -m "Initial Airflow Worker setup for Klutch.sh"
git remote add origin https://github.com/your-username/airflow-worker.git
git push -u origin main

Deploy to Klutch.sh:
- Log in to Klutch.sh
- Create a new project or select an existing one
- Create a new app:
  - Select your GitHub repository
  - Choose the branch you want to deploy (e.g., main)
  - Klutch.sh will automatically detect your Dockerfile
  - Select TCP as the traffic type (workers don’t serve HTTP traffic)
  - Set the internal port to 8793 (Airflow worker’s default monitoring port)
  - Choose your preferred region and compute resources
  - Configure environment variables (see next section)
  - Attach persistent volumes for logs
- Click “Create” to start the deployment
Klutch.sh will build your Docker image and deploy your Airflow Worker. The worker will automatically connect to your message broker and begin processing tasks.

Persistent Storage

Airflow Workers require persistent storage for task execution logs and temporary data. Klutch.sh provides persistent volumes that survive deployments and container restarts.

Required Volumes for Workers

When creating your worker app on Klutch.sh, attach persistent volumes with these mount paths:

Logs Directory:
- Mount Path: /opt/airflow/logs
- Recommended Size: 20-100 GB (depending on task volume and log retention)
- Purpose: Stores task execution logs for debugging and monitoring
Worker State Directory (optional):
- Mount Path: /opt/airflow/worker_state
- Recommended Size: 5 GB
- Purpose: Stores worker process information and state

Important Notes About Volumes

In Klutch.sh, you only specify the mount path and size when creating volumes
Ensure your container process has write permissions to these directories
Workers don’t need access to the metadata database files (unlike the scheduler)
The DAGs directory can either be baked into the Docker image or mounted as a shared volume
Regularly monitor disk usage and adjust volume sizes as needed

If you prefer not to rebuild the Docker image every time DAGs change, you can mount a shared DAGs volume:

Mount Path: /opt/airflow/dags
Size: 5-10 GB
Note: This volume should contain the same DAG files as your scheduler

This approach requires setting up a shared storage solution between your scheduler and workers, which can be achieved through:

Git-sync sidecar containers (advanced setup)
External storage mounted to both scheduler and workers
Or simply rebuilding worker images when DAGs change (simpler approach)

Environment Variables

Configure your Airflow Worker using environment variables. Add these in the Klutch.sh app configuration panel:

Essential Worker Environment Variables

# Core Configuration
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__FERNET_KEY=<your-fernet-key>

# Database Configuration (must match scheduler's database)
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@your-postgres-host:5432/airflow

# Celery Configuration
AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@your-postgres-host:5432/airflow
AIRFLOW__CELERY__WORKER_CONCURRENCY=4
AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER=1

# Logging Configuration
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__REMOTE_LOGGING=False
AIRFLOW__LOGGING__LOGGING_LEVEL=INFO

# Worker Configuration
AIRFLOW__CELERY__WORKER_LOG_SERVER_PORT=8793
AIRFLOW__CELERY__FLOWER_PORT=5555

Critical Configuration Requirements

Fernet Key: Must be the same across all Airflow components (scheduler, webserver, workers)

# Generate with:
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Database Connection: All workers must connect to the same metadata database as the scheduler
Broker URL: All workers must connect to the same message broker (Redis or RabbitMQ)
Result Backend: Should point to your metadata database for task result storage

Optional Performance Tuning Variables

# Adjust worker concurrency based on workload
AIRFLOW__CELERY__WORKER_CONCURRENCY=8

# Pool management
AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER=2

# Task timeout settings
AIRFLOW__CELERY__TASK_SOFT_TIME_LIMIT=600
AIRFLOW__CELERY__TASK_TIME_LIMIT=1200

# Autoscale workers (min, max)
AIRFLOW__CELERY__WORKER_AUTOSCALE=16,4

Using Environment Variables for Customization with Nixpacks

If you need to customize build or start commands when not using a Dockerfile, Klutch.sh uses Nixpacks and supports these environment variables:

START_COMMAND: Override the default start command
BUILD_COMMAND: Override the default build command

However, for Airflow Worker deployments, using a Dockerfile (as shown above) is strongly recommended for complete control over the environment.

Traffic Type and Port Configuration

When deploying Airflow Workers on Klutch.sh:

Traffic Type: Select TCP in the Klutch.sh dashboard (workers don’t serve HTTP traffic)
Internal Port: Set to 8793 (Airflow worker’s log server port for monitoring)
External Access: Workers are backend services and typically don’t require public access

Airflow Workers listen on port 8793 for the log server, which allows the webserver to fetch task logs. When TCP traffic is selected, Klutch.sh routes connections on port 8000 to your container’s specified internal port (8793 in this case).

Important Traffic Notes

Workers primarily communicate with the message broker and database, not external HTTP traffic
The log server (port 8793) is used by the Airflow webserver to fetch logs
If you deploy Flower (Celery monitoring tool), it typically runs on port 5555
Most worker deployments don’t need external port exposure unless you’re running Flower

Connecting to Your Airflow Infrastructure

For workers to function properly, they need access to three core services:

Message Broker (Redis recommended):

Deploy Redis on Klutch.sh or use an external Redis service:
Terminal window
```
AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0
```

PostgreSQL Database:

Deploy PostgreSQL on Klutch.sh or use an external database service:

AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@postgres-host:5432/airflow

Scheduler and Webserver:

Your scheduler must be running and connected to the same broker and database. See the Airflow deployment guide for scheduler setup.

Network Connectivity

Ensure that:

Workers can reach the Redis/RabbitMQ broker
Workers can reach the PostgreSQL database
Network security groups allow traffic between components
Connection strings use accessible hostnames or IP addresses

Scaling and Production Best Practices

Horizontal Scaling:
- Deploy multiple worker instances to handle higher task loads
- Each worker instance should have the same configuration
- Celery automatically distributes tasks across available workers
- Use Klutch.sh’s instance scaling features to add more workers
Worker Concurrency:
- Configure AIRFLOW__CELERY__WORKER_CONCURRENCY based on CPU cores
- General rule: Set to number of CPU cores or slightly higher
- Monitor CPU usage and adjust accordingly
- Consider task type (CPU-bound vs I/O-bound)
Resource Allocation:
- Allocate sufficient memory based on task requirements
- CPU-intensive tasks need more CPU cores
- I/O-intensive tasks benefit from higher concurrency
- Monitor resource usage in Klutch.sh dashboard
Queue Management:
- Use multiple Celery queues for task prioritization
- Deploy specialized workers for specific queues
- Example: --queues=high_priority,default,low_priority
- Configure in your worker startup command
High Availability:
- Deploy workers across multiple availability zones
- Run at least 2-3 worker instances for redundancy
- Workers are stateless and can be easily replaced
- Failed workers don’t lose tasks (handled by message broker)
Monitoring and Logging:
- Monitor worker health through Celery events
- Use Flower for real-time worker monitoring
- Check logs regularly in mounted volume
- Set up alerts for worker failures
Security:
- Use strong passwords for Redis and PostgreSQL
- Keep Fernet key secure and consistent
- Restrict network access to worker services
- Regularly update Airflow and dependencies
Task Isolation:
- Use virtual environments or containers for task dependencies
- Consider KubernetesPodOperator for complete task isolation
- Prevent dependency conflicts between tasks
- Keep worker images lightweight
DAG Synchronization:
- Ensure workers have the same DAG files as scheduler
- Use Git-sync for automatic DAG updates
- Or rebuild worker images when DAGs change
- Test DAGs before deploying to production workers
Performance Optimization:
- Enable worker_prefetch_multiplier for better task distribution
- Use connection pooling for database connections
- Configure appropriate task timeouts
- Implement task retry logic for transient failures

Monitoring Your Workers

Using Flower for Worker Monitoring

Flower is a real-time monitoring tool for Celery. To add Flower to your deployment:

Add Flower to your requirements.txt:
```
flower==2.0.1
```

Deploy Flower as a separate app on Klutch.sh:

celery --app=airflow.providers.celery.executors.celery_executor.app flower

Access Flower UI:
- Deploy with HTTP traffic type
- Set internal port to 5555
- Access at your Klutch.sh URL: https://example-app.klutch.sh

Checking Worker Status

Monitor workers through the Airflow webserver:

Navigate to Admin → Workers in the UI
View active workers and their tasks
Check worker resource utilization
Monitor task queue lengths

Troubleshooting

Common Issues

Worker not connecting to broker:

Verify Redis/RabbitMQ connection string
Check network connectivity
Ensure broker is running and accessible
Review worker logs for connection errors

Tasks not being picked up:

Verify worker is running: celery inspect active
Check that DAGs are loaded correctly
Ensure scheduler is running and assigning tasks
Verify queue names match between scheduler and workers

Database connection errors:

Confirm PostgreSQL connection string
Check database credentials
Ensure database is accessible from worker
Verify database has required Airflow tables

Out of disk space:

Monitor log volume usage in Klutch.sh dashboard
Increase volume size as needed
Implement log rotation policies
Clean up old log files regularly

High memory usage:

Reduce worker concurrency
Check for memory leaks in custom operators
Monitor task memory consumption
Increase worker memory allocation

Tasks timing out:

Increase AIRFLOW__CELERY__TASK_TIME_LIMIT
Check task complexity and optimize
Ensure sufficient worker resources
Review task logs for bottlenecks

Resources

Deploying Apache Airflow Workers on Klutch.sh provides scalable, distributed task execution for your data workflows. With proper configuration, persistent storage, and production best practices, you can build a robust and reliable task processing infrastructure that scales with your needs.