Deploying Airflow
Apache Airflow is a powerful open-source platform for authoring, scheduling, and monitoring workflows programmatically. Originally developed by Airbnb, Airflow enables data engineers and developers to create complex data pipelines as Directed Acyclic Graphs (DAGs) using Python code. With its rich UI, extensive integrations, and robust scheduling capabilities, Airflow has become the industry standard for orchestrating data workflows, ETL processes, and ML pipelines at scale.
This comprehensive guide walks you through deploying Apache Airflow on Klutch.sh using a Dockerfile. You’ll learn how to set up your environment, configure persistent storage for logs and metadata, manage environment variables, and follow best practices for production deployments.
Prerequisites
Before deploying Airflow to Klutch.sh, ensure you have:
- A Klutch.sh account
- A GitHub repository for your Airflow project
- Basic knowledge of Docker and Python
- Understanding of workflow orchestration concepts
- Python 3.8 or higher installed locally for development
Getting Started: Install Airflow Locally
Before deploying to Klutch.sh, it’s recommended to test Airflow locally to understand its structure and create your DAGs.
-
Create a project directory and set up a virtual environment:
Terminal window mkdir my-airflow-projectcd my-airflow-projectpython3 -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activate -
Install Apache Airflow:
Terminal window pip install "apache-airflow==2.7.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt" -
Initialize the Airflow database:
Terminal window export AIRFLOW_HOME=$(pwd)/airflowairflow db init -
Create an admin user:
Terminal window airflow users create \--username admin \--firstname Admin \--lastname User \--role Admin \--email admin@example.com \--password admin -
Start the Airflow webserver and scheduler:
In separate terminal windows (both with your virtual environment activated):
Terminal window # Terminal 1: Start the webserverairflow webserver --port 8080# Terminal 2: Start the schedulerairflow scheduler -
Access the Airflow UI:
Open your browser and navigate to http://localhost:8080. Log in with the credentials you created (username:
admin, password:admin).
Sample DAG Code
Create a sample DAG to test your Airflow installation. Create a file named hello_airflow.py in your airflow/dags directory:
from datetime import datetime, timedeltafrom airflow import DAGfrom airflow.operators.python import PythonOperatorfrom airflow.operators.bash import BashOperator
# Define default argumentsdefault_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2024, 1, 1), 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5),}
# Define the DAGdag = DAG( 'hello_airflow_dag', default_args=default_args, description='A simple Hello World DAG', schedule_interval=timedelta(days=1), catchup=False, tags=['example', 'hello'],)
def print_hello(): """Simple Python function to print hello message""" print("Hello from Airflow on Klutch.sh!") return "Hello from Airflow!"
def print_date(): """Print the current date""" print(f"Current date: {datetime.now()}") return datetime.now()
# Define taskshello_task = PythonOperator( task_id='hello_task', python_callable=print_hello, dag=dag,)
date_task = PythonOperator( task_id='date_task', python_callable=print_date, dag=dag,)
bash_task = BashOperator( task_id='bash_task', bash_command='echo "Running Bash task on Klutch.sh"', dag=dag,)
# Set task dependencieshello_task >> date_task >> bash_taskThis DAG demonstrates basic Airflow concepts including Python operators, Bash operators, and task dependencies. Once your local installation is working, you’re ready to deploy to Klutch.sh.
Deploying Airflow with a Dockerfile
Klutch.sh automatically detects a Dockerfile in your repository’s root directory and uses it to build and deploy your application. This method provides full control over your Airflow installation and dependencies.
-
Create a
Dockerfilein your project root:# Use the official Apache Airflow image as baseFROM apache/airflow:2.7.3-python3.11# Set working directoryWORKDIR /opt/airflow# Switch to root to install system dependencies if neededUSER rootRUN apt-get update && apt-get install -y --no-install-recommends \build-essential \&& apt-get clean \&& rm -rf /var/lib/apt/lists/*# Switch back to airflow userUSER airflow# Copy requirements fileCOPY requirements.txt /opt/airflow/requirements.txt# Install Python dependenciesRUN pip install --no-cache-dir -r requirements.txt# Copy DAGs, plugins, and config filesCOPY dags /opt/airflow/dagsCOPY plugins /opt/airflow/plugins# Expose the webserver portEXPOSE 8080# Set environment variablesENV AIRFLOW_HOME=/opt/airflowENV AIRFLOW__CORE__LOAD_EXAMPLES=False# Initialize database and start services# Note: For production, use a separate database (PostgreSQL)CMD ["bash", "-c", "airflow db init && airflow users create --username admin --firstname Admin --lastname User --role Admin --email admin@example.com --password admin || true && airflow webserver & airflow scheduler"] -
Create a
requirements.txtfile:apache-airflow==2.7.3apache-airflow-providers-http==4.5.0apache-airflow-providers-postgres==5.7.1psycopg2-binary==2.9.9redis==5.0.1celery==5.3.4 -
Create necessary directories:
Terminal window mkdir -p dags plugins logs -
Create a
.dockerignorefile:venv/__pycache__/*.pyc*.pyo*.pyd.git/.gitignore.env.DS_Store*.logairflow/logs/ -
Initialize a Git repository and push to GitHub:
Terminal window git initgit add .git commit -m "Initial Airflow setup for Klutch.sh"git remote add origin https://github.com/your-username/your-airflow-repo.gitgit push -u origin main -
Deploy to Klutch.sh:
- Log in to Klutch.sh
- Create a new project or select an existing one
- Create a new app:
- Select your GitHub repository
- Choose the branch you want to deploy (e.g.,
main) - Klutch.sh will automatically detect your Dockerfile
- Select HTTP as the traffic type
- Set the internal port to 8080 (Airflow’s default webserver port)
- Choose your preferred region and compute resources
- Configure environment variables (see next section)
- Attach persistent volumes for data persistence
- Click “Create” to start the deployment
Klutch.sh will build your Docker image and deploy your Airflow instance. Once deployed, your Airflow webserver will be accessible at a URL like
example-app.klutch.sh.
Persistent Storage
Airflow requires persistent storage for several critical directories to maintain state across deployments and restarts. Klutch.sh provides persistent volumes that you can mount to your container.
Required Volumes
When creating your app on Klutch.sh, attach persistent volumes with the following mount paths:
-
Logs Directory:
- Mount Path:
/opt/airflow/logs - Recommended Size: 10-50 GB (depending on your workload)
- Purpose: Stores task execution logs
- Mount Path:
-
Database Directory (if using SQLite):
- Mount Path:
/opt/airflow/airflow.db - Recommended Size: 5-10 GB
- Purpose: Stores Airflow metadata
- Note: For production, use PostgreSQL instead
- Mount Path:
-
DAGs Directory (optional if DAGs are in the image):
- Mount Path:
/opt/airflow/dags - Recommended Size: 5 GB
- Purpose: Store DAG files separately for hot-reloading
- Mount Path:
Important Notes
- Volumes in Klutch.sh are created by specifying the mount path and size only
- Ensure your container has proper permissions to write to these directories
- For production deployments, use an external PostgreSQL database instead of SQLite
- Regular backups of your volumes are recommended
Environment Variables
Configure your Airflow deployment using environment variables. Add these in the Klutch.sh app configuration:
Essential Environment Variables
# Core ConfigurationAIRFLOW__CORE__EXECUTOR=LocalExecutorAIRFLOW__CORE__LOAD_EXAMPLES=FalseAIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dagsAIRFLOW__CORE__FERNET_KEY=<your-generated-fernet-key>
# Database Configuration (for PostgreSQL)AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@your-postgres-host:5432/airflow
# Webserver ConfigurationAIRFLOW__WEBSERVER__BASE_URL=https://example-app.klutch.shAIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080AIRFLOW__WEBSERVER__EXPOSE_CONFIG=True
# SecurityAIRFLOW__WEBSERVER__SECRET_KEY=<your-secret-key>AIRFLOW__API__AUTH_BACKEND=airflow.api.auth.backend.basic_auth
# LoggingAIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logsAIRFLOW__LOGGING__REMOTE_LOGGING=FalseGenerating Required Keys
Generate a Fernet key for encrypting sensitive data:
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"Generate a secret key for the webserver:
python -c "import secrets; print(secrets.token_hex(32))"Using Celery Executor (for Scaling)
If you need to scale Airflow with multiple workers, use the Celery executor:
AIRFLOW__CORE__EXECUTOR=CeleryExecutorAIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@your-postgres-host:5432/airflowFor Celery deployments, you’ll need to deploy separate worker instances. See the Airflow Worker guide for details.
Traffic Type and Port Configuration
When deploying Airflow on Klutch.sh:
- Traffic Type: Select HTTP in the Klutch.sh dashboard
- Internal Port: Set to 8080 (Airflow webserver’s default port)
- External Access: Your app will be accessible via HTTPS at your assigned Klutch.sh URL (e.g.,
https://example-app.klutch.sh)
Airflow’s webserver listens on port 8080 by default. When you configure the internal port as 8080, Klutch.sh routes external HTTPS traffic to your container’s port 8080.
Using a Production Database
For production deployments, replace SQLite with PostgreSQL:
-
Deploy a PostgreSQL database:
- Create a separate PostgreSQL app on Klutch.sh or use an external managed database
- Note the connection details (host, port, database name, username, password)
-
Update your environment variables:
Terminal window AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@postgres-host:5432/airflow -
Update your Dockerfile to include PostgreSQL support:
The
requirements.txtalready includespsycopg2-binaryfor PostgreSQL support. -
Initialize the database:
On first deployment, run:
Terminal window airflow db initThis is typically handled by your Dockerfile’s CMD, but you can also run it manually if needed.
Monitoring and Logging
Accessing Logs
- Task Logs: Available in the Airflow UI under each task instance
- System Logs: Accessible through Klutch.sh’s logging interface
- Persistent Logs: Stored in the mounted volume at
/opt/airflow/logs
Health Checks
Monitor your Airflow deployment:
- Webserver health:
https://example-app.klutch.sh/health - Scheduler status: Check the Airflow UI’s Admin → Status page
- Database connectivity: Verify in the Airflow UI
Best Practices for Production
-
Use PostgreSQL Database:
- Never use SQLite in production
- Deploy a managed PostgreSQL instance
- Enable connection pooling
-
Secure Your Deployment:
- Set strong passwords for admin users
- Use
AIRFLOW__WEBSERVER__SECRET_KEYandAIRFLOW__CORE__FERNET_KEY - Enable authentication:
AIRFLOW__API__AUTH_BACKEND=airflow.api.auth.backend.basic_auth - Consider implementing OAuth or LDAP for enterprise environments
-
Scale with Celery:
- For high workloads, use CeleryExecutor
- Deploy multiple worker instances
- Use Redis or RabbitMQ as the message broker
-
Implement Proper Logging:
- Configure remote logging to cloud storage (S3, GCS)
- Set appropriate log retention policies
- Monitor disk usage in the logs volume
-
Resource Allocation:
- Allocate sufficient CPU and memory based on your DAG complexity
- Monitor resource usage and scale as needed
- Use Klutch.sh’s scaling features to adjust compute resources
-
DAG Best Practices:
- Keep DAGs in version control
- Use variables and connections for configuration
- Implement proper error handling and retries
- Add documentation to your DAGs
-
Regular Maintenance:
- Update Airflow versions regularly
- Clean up old task instances and logs
- Monitor database size and performance
- Backup your database and volumes regularly
-
Testing:
- Test DAGs locally before deploying
- Use Airflow’s testing utilities
- Implement CI/CD for DAG validation
Customizing Airflow with Nixpacks
While this guide focuses on Dockerfile deployments, Klutch.sh also uses Nixpacks to automatically build applications without a Dockerfile. If you need to customize build or start commands when not using a Dockerfile, you can set these environment variables:
START_COMMAND: Override the default start commandBUILD_COMMAND: Override the default build command
However, for Airflow deployments, using a Dockerfile is strongly recommended for full control over the environment.
Troubleshooting
Common Issues
Webserver not starting:
- Check logs in Klutch.sh dashboard
- Verify database connection string
- Ensure all required environment variables are set
DAGs not appearing:
- Verify DAGs are in
/opt/airflow/dags - Check file permissions
- Look for parsing errors in logs
Database initialization fails:
- Verify PostgreSQL connection
- Check database credentials
- Ensure database exists and user has proper permissions
Out of disk space:
- Increase volume sizes for logs
- Implement log rotation
- Clean up old task instances
Resources
- Apache Airflow Documentation
- Airflow Docker Documentation
- Klutch.sh Volumes Guide
- Klutch.sh Deployments
- Klutch.sh Getting Started
Deploying Apache Airflow on Klutch.sh provides a powerful, scalable platform for orchestrating your data workflows. With proper configuration, persistent storage, and production best practices, you can run complex DAGs reliably and efficiently.