Skip to content

Deploying Airflow Worker

Apache Airflow Worker is a critical component of the Apache Airflow distributed workflow orchestration system. Workers execute tasks that are scheduled by the Airflow scheduler, providing horizontal scalability for processing complex data pipelines and workflows. By deploying Airflow Workers on Klutch.sh, you can leverage scalable cloud infrastructure, automated deployments from GitHub, persistent storage for logs, and flexible compute resources to handle intensive task execution workloads.

This comprehensive guide walks you through deploying Apache Airflow Workers on Klutch.sh using a Dockerfile. You’ll learn how to set up your environment, create a production-ready Dockerfile, configure persistent storage for logs and task metadata, manage environment variables for connecting to your Airflow infrastructure, and follow best practices for scaling and monitoring your worker deployment.

Prerequisites

Before deploying an Airflow Worker to Klutch.sh, ensure you have:

  • A Klutch.sh account
  • A GitHub repository for your Airflow Worker project
  • A running Airflow scheduler and webserver instance (see the Airflow deployment guide)
  • A message broker (Redis or RabbitMQ) for Celery task distribution
  • A metadata database (PostgreSQL recommended) shared with your scheduler
  • Basic knowledge of Docker, Python, and Airflow architecture
  • Understanding of distributed systems and message queues

Understanding Airflow Worker Architecture

Before deployment, it’s important to understand how Airflow Workers fit into the overall Airflow architecture:

  • Scheduler: Monitors DAGs and assigns tasks to the message queue
  • Message Broker: Distributes tasks to available workers (Redis or RabbitMQ)
  • Workers: Execute tasks and report results back through the result backend
  • Metadata Database: Stores task states, DAG definitions, and execution history
  • Webserver: Provides the UI for monitoring (separate from workers)

Workers communicate with the scheduler through a message broker using the Celery executor. Multiple workers can be deployed to scale task execution horizontally, enabling parallel processing of multiple tasks across different workflows.


Getting Started: Install Airflow Worker Locally

Before deploying to Klutch.sh, it’s helpful to understand how Airflow Workers operate by testing locally. This section demonstrates setting up a worker that connects to an existing Airflow infrastructure.

    1. Create a project directory and set up a virtual environment:

      Terminal window
      mkdir airflow-worker
      cd airflow-worker
      python3 -m venv venv
      source venv/bin/activate # On Windows: venv\Scripts\activate
    2. Install Apache Airflow with Celery support:

      Terminal window
      pip install "apache-airflow[celery,redis]==2.7.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt"
    3. Set essential environment variables:

      Terminal window
      export AIRFLOW_HOME=$(pwd)/airflow
      export AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      export AIRFLOW__CELERY__BROKER_URL=redis://localhost:6379/0
      export AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@localhost:5432/airflow
      export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@localhost:5432/airflow

      Note: Replace the connection strings with your actual Redis and PostgreSQL connection details.

    4. Ensure your DAGs directory exists:

      Terminal window
      mkdir -p $AIRFLOW_HOME/dags
    5. Start the Airflow worker:

      Terminal window
      airflow celery worker

      The worker will connect to your message broker and begin listening for tasks from the scheduler.

    6. Verify worker operation:

      In a separate terminal, check worker status through the Airflow webserver UI or via:

      Terminal window
      airflow celery inspect active

Once your local worker connects successfully and can execute tasks from your scheduler, you’re ready to containerize and deploy to Klutch.sh.


Deploying Airflow Worker with a Dockerfile

Klutch.sh automatically detects a Dockerfile in your repository’s root directory and uses it to build and deploy your application. This approach provides complete control over your worker environment, dependencies, and configuration.

    1. Create a Dockerfile in your project root:

      # Use the official Apache Airflow image as base
      FROM apache/airflow:2.7.3-python3.11
      # Set working directory
      WORKDIR /opt/airflow
      # Switch to root to install system dependencies if needed
      USER root
      RUN apt-get update && apt-get install -y --no-install-recommends \
      build-essential \
      libpq-dev \
      && apt-get clean \
      && rm -rf /var/lib/apt/lists/*
      # Switch back to airflow user for security
      USER airflow
      # Copy and install Python dependencies
      COPY requirements.txt /opt/airflow/requirements.txt
      RUN pip install --no-cache-dir -r requirements.txt
      # Copy DAGs directory (workers need access to DAG definitions)
      COPY dags /opt/airflow/dags
      # Copy plugins if you have custom operators or hooks
      COPY plugins /opt/airflow/plugins
      # Set environment variables
      ENV AIRFLOW_HOME=/opt/airflow
      ENV AIRFLOW__CORE__LOAD_EXAMPLES=False
      ENV AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      # Expose port for worker monitoring (optional)
      EXPOSE 8793
      # Start the Celery worker
      CMD ["celery", "--app", "airflow.providers.celery.executors.celery_executor.app", "worker", "--concurrency", "4", "--loglevel", "INFO"]
    2. Create a requirements.txt file:

      apache-airflow[celery,redis]==2.7.3
      apache-airflow-providers-postgres==5.7.1
      psycopg2-binary==2.9.9
      redis==5.0.1
      celery==5.3.4
      # Add any custom dependencies your DAGs require
      pandas==2.1.3
      requests==2.31.0
      boto3==1.29.7
    3. Create necessary directory structure:

      Terminal window
      mkdir -p dags plugins logs config
    4. Copy your DAG files to the dags directory:

      Workers need access to the same DAG definitions as your scheduler. Either:

      • Copy DAGs into the image (shown in Dockerfile above)
      • Or mount a shared volume containing DAGs (covered in persistent storage section)
    5. Create a .dockerignore file:

      venv/
      __pycache__/
      *.pyc
      *.pyo
      *.pyd
      .git/
      .gitignore
      .env
      .DS_Store
      *.log
      logs/
      airflow.db
      airflow.cfg
      webserver_config.py
    6. Test your Docker build locally:

      Terminal window
      docker build -t airflow-worker:local .
      docker run --rm \
      -e AIRFLOW__CELERY__BROKER_URL=redis://your-redis:6379/0 \
      -e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://user:pass@db:5432/airflow \
      airflow-worker:local
    7. Initialize a Git repository and push to GitHub:

      Terminal window
      git init
      git add .
      git commit -m "Initial Airflow Worker setup for Klutch.sh"
      git remote add origin https://github.com/your-username/airflow-worker.git
      git push -u origin main
    8. Deploy to Klutch.sh:

      • Log in to Klutch.sh
      • Create a new project or select an existing one
      • Create a new app:
        • Select your GitHub repository
        • Choose the branch you want to deploy (e.g., main)
        • Klutch.sh will automatically detect your Dockerfile
        • Select TCP as the traffic type (workers don’t serve HTTP traffic)
        • Set the internal port to 8793 (Airflow worker’s default monitoring port)
        • Choose your preferred region and compute resources
        • Configure environment variables (see next section)
        • Attach persistent volumes for logs
      • Click “Create” to start the deployment

      Klutch.sh will build your Docker image and deploy your Airflow Worker. The worker will automatically connect to your message broker and begin processing tasks.


Persistent Storage

Airflow Workers require persistent storage for task execution logs and temporary data. Klutch.sh provides persistent volumes that survive deployments and container restarts.

Required Volumes for Workers

When creating your worker app on Klutch.sh, attach persistent volumes with these mount paths:

    1. Logs Directory:

      • Mount Path: /opt/airflow/logs
      • Recommended Size: 20-100 GB (depending on task volume and log retention)
      • Purpose: Stores task execution logs for debugging and monitoring
    2. Worker State Directory (optional):

      • Mount Path: /opt/airflow/worker_state
      • Recommended Size: 5 GB
      • Purpose: Stores worker process information and state

Important Notes About Volumes

  • In Klutch.sh, you only specify the mount path and size when creating volumes
  • Ensure your container process has write permissions to these directories
  • Workers don’t need access to the metadata database files (unlike the scheduler)
  • The DAGs directory can either be baked into the Docker image or mounted as a shared volume
  • Regularly monitor disk usage and adjust volume sizes as needed

Sharing DAGs via Volumes (Alternative Approach)

If you prefer not to rebuild the Docker image every time DAGs change, you can mount a shared DAGs volume:

  • Mount Path: /opt/airflow/dags
  • Size: 5-10 GB
  • Note: This volume should contain the same DAG files as your scheduler

This approach requires setting up a shared storage solution between your scheduler and workers, which can be achieved through:

  • Git-sync sidecar containers (advanced setup)
  • External storage mounted to both scheduler and workers
  • Or simply rebuilding worker images when DAGs change (simpler approach)

Environment Variables

Configure your Airflow Worker using environment variables. Add these in the Klutch.sh app configuration panel:

Essential Worker Environment Variables

Terminal window
# Core Configuration
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__FERNET_KEY=<your-fernet-key>
# Database Configuration (must match scheduler's database)
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@your-postgres-host:5432/airflow
# Celery Configuration
AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@your-postgres-host:5432/airflow
AIRFLOW__CELERY__WORKER_CONCURRENCY=4
AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER=1
# Logging Configuration
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__REMOTE_LOGGING=False
AIRFLOW__LOGGING__LOGGING_LEVEL=INFO
# Worker Configuration
AIRFLOW__CELERY__WORKER_LOG_SERVER_PORT=8793
AIRFLOW__CELERY__FLOWER_PORT=5555

Critical Configuration Requirements

  1. Fernet Key: Must be the same across all Airflow components (scheduler, webserver, workers)

    Terminal window
    # Generate with:
    python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
  2. Database Connection: All workers must connect to the same metadata database as the scheduler

  3. Broker URL: All workers must connect to the same message broker (Redis or RabbitMQ)

  4. Result Backend: Should point to your metadata database for task result storage

Optional Performance Tuning Variables

Terminal window
# Adjust worker concurrency based on workload
AIRFLOW__CELERY__WORKER_CONCURRENCY=8
# Pool management
AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER=2
# Task timeout settings
AIRFLOW__CELERY__TASK_SOFT_TIME_LIMIT=600
AIRFLOW__CELERY__TASK_TIME_LIMIT=1200
# Autoscale workers (min, max)
AIRFLOW__CELERY__WORKER_AUTOSCALE=16,4

Using Environment Variables for Customization with Nixpacks

If you need to customize build or start commands when not using a Dockerfile, Klutch.sh uses Nixpacks and supports these environment variables:

  • START_COMMAND: Override the default start command
  • BUILD_COMMAND: Override the default build command

However, for Airflow Worker deployments, using a Dockerfile (as shown above) is strongly recommended for complete control over the environment.


Traffic Type and Port Configuration

When deploying Airflow Workers on Klutch.sh:

  • Traffic Type: Select TCP in the Klutch.sh dashboard (workers don’t serve HTTP traffic)
  • Internal Port: Set to 8793 (Airflow worker’s log server port for monitoring)
  • External Access: Workers are backend services and typically don’t require public access

Airflow Workers listen on port 8793 for the log server, which allows the webserver to fetch task logs. When TCP traffic is selected, Klutch.sh routes connections on port 8000 to your container’s specified internal port (8793 in this case).

Important Traffic Notes

  • Workers primarily communicate with the message broker and database, not external HTTP traffic
  • The log server (port 8793) is used by the Airflow webserver to fetch logs
  • If you deploy Flower (Celery monitoring tool), it typically runs on port 5555
  • Most worker deployments don’t need external port exposure unless you’re running Flower

Connecting to Your Airflow Infrastructure

For workers to function properly, they need access to three core services:

    1. Message Broker (Redis recommended):

      Deploy Redis on Klutch.sh or use an external Redis service:

      Terminal window
      AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0
    2. PostgreSQL Database:

      Deploy PostgreSQL on Klutch.sh or use an external database service:

      Terminal window
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@postgres-host:5432/airflow
    3. Scheduler and Webserver:

      Your scheduler must be running and connected to the same broker and database. See the Airflow deployment guide for scheduler setup.

Network Connectivity

Ensure that:

  • Workers can reach the Redis/RabbitMQ broker
  • Workers can reach the PostgreSQL database
  • Network security groups allow traffic between components
  • Connection strings use accessible hostnames or IP addresses

Scaling and Production Best Practices

    1. Horizontal Scaling:

      • Deploy multiple worker instances to handle higher task loads
      • Each worker instance should have the same configuration
      • Celery automatically distributes tasks across available workers
      • Use Klutch.sh’s instance scaling features to add more workers
    2. Worker Concurrency:

      • Configure AIRFLOW__CELERY__WORKER_CONCURRENCY based on CPU cores
      • General rule: Set to number of CPU cores or slightly higher
      • Monitor CPU usage and adjust accordingly
      • Consider task type (CPU-bound vs I/O-bound)
    3. Resource Allocation:

      • Allocate sufficient memory based on task requirements
      • CPU-intensive tasks need more CPU cores
      • I/O-intensive tasks benefit from higher concurrency
      • Monitor resource usage in Klutch.sh dashboard
    4. Queue Management:

      • Use multiple Celery queues for task prioritization
      • Deploy specialized workers for specific queues
      • Example: --queues=high_priority,default,low_priority
      • Configure in your worker startup command
    5. High Availability:

      • Deploy workers across multiple availability zones
      • Run at least 2-3 worker instances for redundancy
      • Workers are stateless and can be easily replaced
      • Failed workers don’t lose tasks (handled by message broker)
    6. Monitoring and Logging:

      • Monitor worker health through Celery events
      • Use Flower for real-time worker monitoring
      • Check logs regularly in mounted volume
      • Set up alerts for worker failures
    7. Security:

      • Use strong passwords for Redis and PostgreSQL
      • Keep Fernet key secure and consistent
      • Restrict network access to worker services
      • Regularly update Airflow and dependencies
    8. Task Isolation:

      • Use virtual environments or containers for task dependencies
      • Consider KubernetesPodOperator for complete task isolation
      • Prevent dependency conflicts between tasks
      • Keep worker images lightweight
    9. DAG Synchronization:

      • Ensure workers have the same DAG files as scheduler
      • Use Git-sync for automatic DAG updates
      • Or rebuild worker images when DAGs change
      • Test DAGs before deploying to production workers
    10. Performance Optimization:

      • Enable worker_prefetch_multiplier for better task distribution
      • Use connection pooling for database connections
      • Configure appropriate task timeouts
      • Implement task retry logic for transient failures

Monitoring Your Workers

Using Flower for Worker Monitoring

Flower is a real-time monitoring tool for Celery. To add Flower to your deployment:

  1. Add Flower to your requirements.txt:

    flower==2.0.1
  2. Deploy Flower as a separate app on Klutch.sh:

    Terminal window
    celery --app=airflow.providers.celery.executors.celery_executor.app flower
  3. Access Flower UI:

    • Deploy with HTTP traffic type
    • Set internal port to 5555
    • Access at your Klutch.sh URL: https://example-app.klutch.sh

Checking Worker Status

Monitor workers through the Airflow webserver:

  • Navigate to Admin → Workers in the UI
  • View active workers and their tasks
  • Check worker resource utilization
  • Monitor task queue lengths

Troubleshooting

Common Issues

Worker not connecting to broker:

  • Verify Redis/RabbitMQ connection string
  • Check network connectivity
  • Ensure broker is running and accessible
  • Review worker logs for connection errors

Tasks not being picked up:

  • Verify worker is running: celery inspect active
  • Check that DAGs are loaded correctly
  • Ensure scheduler is running and assigning tasks
  • Verify queue names match between scheduler and workers

Database connection errors:

  • Confirm PostgreSQL connection string
  • Check database credentials
  • Ensure database is accessible from worker
  • Verify database has required Airflow tables

Out of disk space:

  • Monitor log volume usage in Klutch.sh dashboard
  • Increase volume size as needed
  • Implement log rotation policies
  • Clean up old log files regularly

High memory usage:

  • Reduce worker concurrency
  • Check for memory leaks in custom operators
  • Monitor task memory consumption
  • Increase worker memory allocation

Tasks timing out:

  • Increase AIRFLOW__CELERY__TASK_TIME_LIMIT
  • Check task complexity and optimize
  • Ensure sufficient worker resources
  • Review task logs for bottlenecks

Resources


Deploying Apache Airflow Workers on Klutch.sh provides scalable, distributed task execution for your data workflows. With proper configuration, persistent storage, and production best practices, you can build a robust and reliable task processing infrastructure that scales with your needs.