Deploying Airflow Worker
Apache Airflow Worker is a critical component of the Apache Airflow distributed workflow orchestration system. Workers execute tasks that are scheduled by the Airflow scheduler, providing horizontal scalability for processing complex data pipelines and workflows. By deploying Airflow Workers on Klutch.sh, you can leverage scalable cloud infrastructure, automated deployments from GitHub, persistent storage for logs, and flexible compute resources to handle intensive task execution workloads.
This comprehensive guide walks you through deploying Apache Airflow Workers on Klutch.sh using a Dockerfile. You’ll learn how to set up your environment, create a production-ready Dockerfile, configure persistent storage for logs and task metadata, manage environment variables for connecting to your Airflow infrastructure, and follow best practices for scaling and monitoring your worker deployment.
Prerequisites
Before deploying an Airflow Worker to Klutch.sh, ensure you have:
- A Klutch.sh account
- A GitHub repository for your Airflow Worker project
- A running Airflow scheduler and webserver instance (see the Airflow deployment guide)
- A message broker (Redis or RabbitMQ) for Celery task distribution
- A metadata database (PostgreSQL recommended) shared with your scheduler
- Basic knowledge of Docker, Python, and Airflow architecture
- Understanding of distributed systems and message queues
Understanding Airflow Worker Architecture
Before deployment, it’s important to understand how Airflow Workers fit into the overall Airflow architecture:
- Scheduler: Monitors DAGs and assigns tasks to the message queue
- Message Broker: Distributes tasks to available workers (Redis or RabbitMQ)
- Workers: Execute tasks and report results back through the result backend
- Metadata Database: Stores task states, DAG definitions, and execution history
- Webserver: Provides the UI for monitoring (separate from workers)
Workers communicate with the scheduler through a message broker using the Celery executor. Multiple workers can be deployed to scale task execution horizontally, enabling parallel processing of multiple tasks across different workflows.
Getting Started: Install Airflow Worker Locally
Before deploying to Klutch.sh, it’s helpful to understand how Airflow Workers operate by testing locally. This section demonstrates setting up a worker that connects to an existing Airflow infrastructure.
-
Create a project directory and set up a virtual environment:
Terminal window mkdir airflow-workercd airflow-workerpython3 -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activate -
Install Apache Airflow with Celery support:
Terminal window pip install "apache-airflow[celery,redis]==2.7.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt" -
Set essential environment variables:
Terminal window export AIRFLOW_HOME=$(pwd)/airflowexport AIRFLOW__CORE__EXECUTOR=CeleryExecutorexport AIRFLOW__CELERY__BROKER_URL=redis://localhost:6379/0export AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@localhost:5432/airflowexport AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@localhost:5432/airflowNote: Replace the connection strings with your actual Redis and PostgreSQL connection details.
-
Ensure your DAGs directory exists:
Terminal window mkdir -p $AIRFLOW_HOME/dags -
Start the Airflow worker:
Terminal window airflow celery workerThe worker will connect to your message broker and begin listening for tasks from the scheduler.
-
Verify worker operation:
In a separate terminal, check worker status through the Airflow webserver UI or via:
Terminal window airflow celery inspect active
Once your local worker connects successfully and can execute tasks from your scheduler, you’re ready to containerize and deploy to Klutch.sh.
Deploying Airflow Worker with a Dockerfile
Klutch.sh automatically detects a Dockerfile in your repository’s root directory and uses it to build and deploy your application. This approach provides complete control over your worker environment, dependencies, and configuration.
-
Create a
Dockerfilein your project root:# Use the official Apache Airflow image as baseFROM apache/airflow:2.7.3-python3.11# Set working directoryWORKDIR /opt/airflow# Switch to root to install system dependencies if neededUSER rootRUN apt-get update && apt-get install -y --no-install-recommends \build-essential \libpq-dev \&& apt-get clean \&& rm -rf /var/lib/apt/lists/*# Switch back to airflow user for securityUSER airflow# Copy and install Python dependenciesCOPY requirements.txt /opt/airflow/requirements.txtRUN pip install --no-cache-dir -r requirements.txt# Copy DAGs directory (workers need access to DAG definitions)COPY dags /opt/airflow/dags# Copy plugins if you have custom operators or hooksCOPY plugins /opt/airflow/plugins# Set environment variablesENV AIRFLOW_HOME=/opt/airflowENV AIRFLOW__CORE__LOAD_EXAMPLES=FalseENV AIRFLOW__CORE__EXECUTOR=CeleryExecutor# Expose port for worker monitoring (optional)EXPOSE 8793# Start the Celery workerCMD ["celery", "--app", "airflow.providers.celery.executors.celery_executor.app", "worker", "--concurrency", "4", "--loglevel", "INFO"] -
Create a
requirements.txtfile:apache-airflow[celery,redis]==2.7.3apache-airflow-providers-postgres==5.7.1psycopg2-binary==2.9.9redis==5.0.1celery==5.3.4# Add any custom dependencies your DAGs requirepandas==2.1.3requests==2.31.0boto3==1.29.7 -
Create necessary directory structure:
Terminal window mkdir -p dags plugins logs config -
Copy your DAG files to the dags directory:
Workers need access to the same DAG definitions as your scheduler. Either:
- Copy DAGs into the image (shown in Dockerfile above)
- Or mount a shared volume containing DAGs (covered in persistent storage section)
-
Create a
.dockerignorefile:venv/__pycache__/*.pyc*.pyo*.pyd.git/.gitignore.env.DS_Store*.loglogs/airflow.dbairflow.cfgwebserver_config.py -
Test your Docker build locally:
Terminal window docker build -t airflow-worker:local .docker run --rm \-e AIRFLOW__CELERY__BROKER_URL=redis://your-redis:6379/0 \-e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://user:pass@db:5432/airflow \airflow-worker:local -
Initialize a Git repository and push to GitHub:
Terminal window git initgit add .git commit -m "Initial Airflow Worker setup for Klutch.sh"git remote add origin https://github.com/your-username/airflow-worker.gitgit push -u origin main -
Deploy to Klutch.sh:
- Log in to Klutch.sh
- Create a new project or select an existing one
- Create a new app:
- Select your GitHub repository
- Choose the branch you want to deploy (e.g.,
main) - Klutch.sh will automatically detect your Dockerfile
- Select TCP as the traffic type (workers don’t serve HTTP traffic)
- Set the internal port to 8793 (Airflow worker’s default monitoring port)
- Choose your preferred region and compute resources
- Configure environment variables (see next section)
- Attach persistent volumes for logs
- Click “Create” to start the deployment
Klutch.sh will build your Docker image and deploy your Airflow Worker. The worker will automatically connect to your message broker and begin processing tasks.
Persistent Storage
Airflow Workers require persistent storage for task execution logs and temporary data. Klutch.sh provides persistent volumes that survive deployments and container restarts.
Required Volumes for Workers
When creating your worker app on Klutch.sh, attach persistent volumes with these mount paths:
-
Logs Directory:
- Mount Path:
/opt/airflow/logs - Recommended Size: 20-100 GB (depending on task volume and log retention)
- Purpose: Stores task execution logs for debugging and monitoring
- Mount Path:
-
Worker State Directory (optional):
- Mount Path:
/opt/airflow/worker_state - Recommended Size: 5 GB
- Purpose: Stores worker process information and state
- Mount Path:
Important Notes About Volumes
- In Klutch.sh, you only specify the mount path and size when creating volumes
- Ensure your container process has write permissions to these directories
- Workers don’t need access to the metadata database files (unlike the scheduler)
- The DAGs directory can either be baked into the Docker image or mounted as a shared volume
- Regularly monitor disk usage and adjust volume sizes as needed
Sharing DAGs via Volumes (Alternative Approach)
If you prefer not to rebuild the Docker image every time DAGs change, you can mount a shared DAGs volume:
- Mount Path:
/opt/airflow/dags - Size: 5-10 GB
- Note: This volume should contain the same DAG files as your scheduler
This approach requires setting up a shared storage solution between your scheduler and workers, which can be achieved through:
- Git-sync sidecar containers (advanced setup)
- External storage mounted to both scheduler and workers
- Or simply rebuilding worker images when DAGs change (simpler approach)
Environment Variables
Configure your Airflow Worker using environment variables. Add these in the Klutch.sh app configuration panel:
Essential Worker Environment Variables
# Core ConfigurationAIRFLOW__CORE__EXECUTOR=CeleryExecutorAIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dagsAIRFLOW__CORE__LOAD_EXAMPLES=FalseAIRFLOW__CORE__FERNET_KEY=<your-fernet-key>
# Database Configuration (must match scheduler's database)AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@your-postgres-host:5432/airflow
# Celery ConfigurationAIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@your-postgres-host:5432/airflowAIRFLOW__CELERY__WORKER_CONCURRENCY=4AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER=1
# Logging ConfigurationAIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logsAIRFLOW__LOGGING__REMOTE_LOGGING=FalseAIRFLOW__LOGGING__LOGGING_LEVEL=INFO
# Worker ConfigurationAIRFLOW__CELERY__WORKER_LOG_SERVER_PORT=8793AIRFLOW__CELERY__FLOWER_PORT=5555Critical Configuration Requirements
-
Fernet Key: Must be the same across all Airflow components (scheduler, webserver, workers)
Terminal window # Generate with:python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" -
Database Connection: All workers must connect to the same metadata database as the scheduler
-
Broker URL: All workers must connect to the same message broker (Redis or RabbitMQ)
-
Result Backend: Should point to your metadata database for task result storage
Optional Performance Tuning Variables
# Adjust worker concurrency based on workloadAIRFLOW__CELERY__WORKER_CONCURRENCY=8
# Pool managementAIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER=2
# Task timeout settingsAIRFLOW__CELERY__TASK_SOFT_TIME_LIMIT=600AIRFLOW__CELERY__TASK_TIME_LIMIT=1200
# Autoscale workers (min, max)AIRFLOW__CELERY__WORKER_AUTOSCALE=16,4Using Environment Variables for Customization with Nixpacks
If you need to customize build or start commands when not using a Dockerfile, Klutch.sh uses Nixpacks and supports these environment variables:
START_COMMAND: Override the default start commandBUILD_COMMAND: Override the default build command
However, for Airflow Worker deployments, using a Dockerfile (as shown above) is strongly recommended for complete control over the environment.
Traffic Type and Port Configuration
When deploying Airflow Workers on Klutch.sh:
- Traffic Type: Select TCP in the Klutch.sh dashboard (workers don’t serve HTTP traffic)
- Internal Port: Set to 8793 (Airflow worker’s log server port for monitoring)
- External Access: Workers are backend services and typically don’t require public access
Airflow Workers listen on port 8793 for the log server, which allows the webserver to fetch task logs. When TCP traffic is selected, Klutch.sh routes connections on port 8000 to your container’s specified internal port (8793 in this case).
Important Traffic Notes
- Workers primarily communicate with the message broker and database, not external HTTP traffic
- The log server (port 8793) is used by the Airflow webserver to fetch logs
- If you deploy Flower (Celery monitoring tool), it typically runs on port 5555
- Most worker deployments don’t need external port exposure unless you’re running Flower
Connecting to Your Airflow Infrastructure
For workers to function properly, they need access to three core services:
-
Message Broker (Redis recommended):
Deploy Redis on Klutch.sh or use an external Redis service:
Terminal window AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0 -
PostgreSQL Database:
Deploy PostgreSQL on Klutch.sh or use an external database service:
Terminal window AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@postgres-host:5432/airflow -
Scheduler and Webserver:
Your scheduler must be running and connected to the same broker and database. See the Airflow deployment guide for scheduler setup.
Network Connectivity
Ensure that:
- Workers can reach the Redis/RabbitMQ broker
- Workers can reach the PostgreSQL database
- Network security groups allow traffic between components
- Connection strings use accessible hostnames or IP addresses
Scaling and Production Best Practices
-
Horizontal Scaling:
- Deploy multiple worker instances to handle higher task loads
- Each worker instance should have the same configuration
- Celery automatically distributes tasks across available workers
- Use Klutch.sh’s instance scaling features to add more workers
-
Worker Concurrency:
- Configure
AIRFLOW__CELERY__WORKER_CONCURRENCYbased on CPU cores - General rule: Set to number of CPU cores or slightly higher
- Monitor CPU usage and adjust accordingly
- Consider task type (CPU-bound vs I/O-bound)
- Configure
-
Resource Allocation:
- Allocate sufficient memory based on task requirements
- CPU-intensive tasks need more CPU cores
- I/O-intensive tasks benefit from higher concurrency
- Monitor resource usage in Klutch.sh dashboard
-
Queue Management:
- Use multiple Celery queues for task prioritization
- Deploy specialized workers for specific queues
- Example:
--queues=high_priority,default,low_priority - Configure in your worker startup command
-
High Availability:
- Deploy workers across multiple availability zones
- Run at least 2-3 worker instances for redundancy
- Workers are stateless and can be easily replaced
- Failed workers don’t lose tasks (handled by message broker)
-
Monitoring and Logging:
- Monitor worker health through Celery events
- Use Flower for real-time worker monitoring
- Check logs regularly in mounted volume
- Set up alerts for worker failures
-
Security:
- Use strong passwords for Redis and PostgreSQL
- Keep Fernet key secure and consistent
- Restrict network access to worker services
- Regularly update Airflow and dependencies
-
Task Isolation:
- Use virtual environments or containers for task dependencies
- Consider KubernetesPodOperator for complete task isolation
- Prevent dependency conflicts between tasks
- Keep worker images lightweight
-
DAG Synchronization:
- Ensure workers have the same DAG files as scheduler
- Use Git-sync for automatic DAG updates
- Or rebuild worker images when DAGs change
- Test DAGs before deploying to production workers
-
Performance Optimization:
- Enable
worker_prefetch_multiplierfor better task distribution - Use connection pooling for database connections
- Configure appropriate task timeouts
- Implement task retry logic for transient failures
- Enable
Monitoring Your Workers
Using Flower for Worker Monitoring
Flower is a real-time monitoring tool for Celery. To add Flower to your deployment:
-
Add Flower to your requirements.txt:
flower==2.0.1 -
Deploy Flower as a separate app on Klutch.sh:
Terminal window celery --app=airflow.providers.celery.executors.celery_executor.app flower -
Access Flower UI:
- Deploy with HTTP traffic type
- Set internal port to 5555
- Access at your Klutch.sh URL:
https://example-app.klutch.sh
Checking Worker Status
Monitor workers through the Airflow webserver:
- Navigate to Admin → Workers in the UI
- View active workers and their tasks
- Check worker resource utilization
- Monitor task queue lengths
Troubleshooting
Common Issues
Worker not connecting to broker:
- Verify Redis/RabbitMQ connection string
- Check network connectivity
- Ensure broker is running and accessible
- Review worker logs for connection errors
Tasks not being picked up:
- Verify worker is running:
celery inspect active - Check that DAGs are loaded correctly
- Ensure scheduler is running and assigning tasks
- Verify queue names match between scheduler and workers
Database connection errors:
- Confirm PostgreSQL connection string
- Check database credentials
- Ensure database is accessible from worker
- Verify database has required Airflow tables
Out of disk space:
- Monitor log volume usage in Klutch.sh dashboard
- Increase volume size as needed
- Implement log rotation policies
- Clean up old log files regularly
High memory usage:
- Reduce worker concurrency
- Check for memory leaks in custom operators
- Monitor task memory consumption
- Increase worker memory allocation
Tasks timing out:
- Increase
AIRFLOW__CELERY__TASK_TIME_LIMIT - Check task complexity and optimize
- Ensure sufficient worker resources
- Review task logs for bottlenecks
Resources
- Airflow Celery Executor Documentation
- Airflow Production Deployment Guide
- Celery Documentation
- Flower Monitoring Tool
- Klutch.sh Airflow Deployment Guide
- Klutch.sh Volumes Guide
- Klutch.sh Deployments
- Klutch.sh Getting Started
Deploying Apache Airflow Workers on Klutch.sh provides scalable, distributed task execution for your data workflows. With proper configuration, persistent storage, and production best practices, you can build a robust and reliable task processing infrastructure that scales with your needs.