Deploying Airflow

Apache Airflow is a powerful open-source platform for authoring, scheduling, and monitoring workflows programmatically. Originally developed by Airbnb, Airflow enables data engineers and developers to create complex data pipelines as Directed Acyclic Graphs (DAGs) using Python code. With its rich UI, extensive integrations, and robust scheduling capabilities, Airflow has become the industry standard for orchestrating data workflows, ETL processes, and ML pipelines at scale.

This comprehensive guide walks you through deploying Apache Airflow on Klutch.sh using a Dockerfile. You’ll learn how to set up your environment, configure persistent storage for logs and metadata, manage environment variables, and follow best practices for production deployments.

Prerequisites

Before deploying Airflow to Klutch.sh, ensure you have:

A Klutch.sh account
A GitHub repository for your Airflow project
Basic knowledge of Docker and Python
Understanding of workflow orchestration concepts
Python 3.8 or higher installed locally for development

Getting Started: Install Airflow Locally

Before deploying to Klutch.sh, it’s recommended to test Airflow locally to understand its structure and create your DAGs.

Create a project directory and set up a virtual environment:

mkdir my-airflow-project
cd my-airflow-project
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Apache Airflow:

pip install "apache-airflow==2.7.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt"

Initialize the Airflow database:

export AIRFLOW_HOME=$(pwd)/airflow
airflow db init

Create an admin user:

airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email admin@example.com \
    --password admin

Start the Airflow webserver and scheduler:

In separate terminal windows (both with your virtual environment activated):
Terminal window
```
# Terminal 1: Start the webserver
airflow webserver --port 8080

# Terminal 2: Start the scheduler
airflow scheduler
```
Access the Airflow UI:

Open your browser and navigate to http://localhost:8080. Log in with the credentials you created (username: admin, password: admin).

Sample DAG Code

Create a sample DAG to test your Airflow installation. Create a file named hello_airflow.py in your airflow/dags directory:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator

# Define default arguments
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
dag = DAG(
    'hello_airflow_dag',
    default_args=default_args,
    description='A simple Hello World DAG',
    schedule_interval=timedelta(days=1),
    catchup=False,
    tags=['example', 'hello'],
)

def print_hello():
    """Simple Python function to print hello message"""
    print("Hello from Airflow on Klutch.sh!")
    return "Hello from Airflow!"

def print_date():
    """Print the current date"""
    print(f"Current date: {datetime.now()}")
    return datetime.now()

# Define tasks
hello_task = PythonOperator(
    task_id='hello_task',
    python_callable=print_hello,
    dag=dag,
)

date_task = PythonOperator(
    task_id='date_task',
    python_callable=print_date,
    dag=dag,
)

bash_task = BashOperator(
    task_id='bash_task',
    bash_command='echo "Running Bash task on Klutch.sh"',
    dag=dag,
)

# Set task dependencies
hello_task >> date_task >> bash_task

This DAG demonstrates basic Airflow concepts including Python operators, Bash operators, and task dependencies. Once your local installation is working, you’re ready to deploy to Klutch.sh.

Deploying Airflow with a Dockerfile

Klutch.sh automatically detects a Dockerfile in your repository’s root directory and uses it to build and deploy your application. This method provides full control over your Airflow installation and dependencies.

Create a Dockerfile in your project root:

# Use the official Apache Airflow image as base
FROM apache/airflow:2.7.3-python3.11

# Set working directory
WORKDIR /opt/airflow

# Switch to root to install system dependencies if needed
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Switch back to airflow user
USER airflow

# Copy requirements file
COPY requirements.txt /opt/airflow/requirements.txt

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy DAGs, plugins, and config files
COPY dags /opt/airflow/dags
COPY plugins /opt/airflow/plugins

# Expose the webserver port
EXPOSE 8080

# Set environment variables
ENV AIRFLOW_HOME=/opt/airflow
ENV AIRFLOW__CORE__LOAD_EXAMPLES=False

# Initialize database and start services
# Note: For production, use a separate database (PostgreSQL)
CMD ["bash", "-c", "airflow db init && airflow users create --username admin --firstname Admin --lastname User --role Admin --email admin@example.com --password admin || true && airflow webserver & airflow scheduler"]

Create a requirements.txt file:

apache-airflow==2.7.3
apache-airflow-providers-http==4.5.0
apache-airflow-providers-postgres==5.7.1
psycopg2-binary==2.9.9
redis==5.0.1
celery==5.3.4

Create necessary directories:
Terminal window
```
mkdir -p dags plugins logs
```

Create a .dockerignore file:

venv/
__pycache__/
*.pyc
*.pyo
*.pyd
.git/
.gitignore
.env
.DS_Store
*.log
airflow/logs/

Initialize a Git repository and push to GitHub:

git init
git add .
git commit -m "Initial Airflow setup for Klutch.sh"
git remote add origin https://github.com/your-username/your-airflow-repo.git
git push -u origin main

Deploy to Klutch.sh:
- Log in to Klutch.sh
- Create a new project or select an existing one
- Create a new app:
  - Select your GitHub repository
  - Choose the branch you want to deploy (e.g., main)
  - Klutch.sh will automatically detect your Dockerfile
  - Select HTTP as the traffic type
  - Set the internal port to 8080 (Airflow’s default webserver port)
  - Choose your preferred region and compute resources
  - Configure environment variables (see next section)
  - Attach persistent volumes for data persistence
- Click “Create” to start the deployment
Klutch.sh will build your Docker image and deploy your Airflow instance. Once deployed, your Airflow webserver will be accessible at a URL like example-app.klutch.sh.

Persistent Storage

Airflow requires persistent storage for several critical directories to maintain state across deployments and restarts. Klutch.sh provides persistent volumes that you can mount to your container.

Required Volumes

When creating your app on Klutch.sh, attach persistent volumes with the following mount paths:

Logs Directory:
- Mount Path: /opt/airflow/logs
- Recommended Size: 10-50 GB (depending on your workload)
- Purpose: Stores task execution logs
Database Directory (if using SQLite):
- Mount Path: /opt/airflow/airflow.db
- Recommended Size: 5-10 GB
- Purpose: Stores Airflow metadata
- Note: For production, use PostgreSQL instead
DAGs Directory (optional if DAGs are in the image):
- Mount Path: /opt/airflow/dags
- Recommended Size: 5 GB
- Purpose: Store DAG files separately for hot-reloading

Important Notes

Volumes in Klutch.sh are created by specifying the mount path and size only
Ensure your container has proper permissions to write to these directories
For production deployments, use an external PostgreSQL database instead of SQLite
Regular backups of your volumes are recommended

Environment Variables

Configure your Airflow deployment using environment variables. Add these in the Klutch.sh app configuration:

Essential Environment Variables

# Core Configuration
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__CORE__FERNET_KEY=<your-generated-fernet-key>

# Database Configuration (for PostgreSQL)
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@your-postgres-host:5432/airflow

# Webserver Configuration
AIRFLOW__WEBSERVER__BASE_URL=https://example-app.klutch.sh
AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080
AIRFLOW__WEBSERVER__EXPOSE_CONFIG=True

# Security
AIRFLOW__WEBSERVER__SECRET_KEY=<your-secret-key>
AIRFLOW__API__AUTH_BACKEND=airflow.api.auth.backend.basic_auth

# Logging
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__REMOTE_LOGGING=False

Generating Required Keys

Generate a Fernet key for encrypting sensitive data:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Generate a secret key for the webserver:

python -c "import secrets; print(secrets.token_hex(32))"

Using Celery Executor (for Scaling)

If you need to scale Airflow with multiple workers, use the Celery executor:

AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@your-postgres-host:5432/airflow

For Celery deployments, you’ll need to deploy separate worker instances. See the Airflow Worker guide for details.

Traffic Type and Port Configuration

When deploying Airflow on Klutch.sh:

Traffic Type: Select HTTP in the Klutch.sh dashboard
Internal Port: Set to 8080 (Airflow webserver’s default port)
External Access: Your app will be accessible via HTTPS at your assigned Klutch.sh URL (e.g., https://example-app.klutch.sh)

Airflow’s webserver listens on port 8080 by default. When you configure the internal port as 8080, Klutch.sh routes external HTTPS traffic to your container’s port 8080.

Using a Production Database

For production deployments, replace SQLite with PostgreSQL:

Deploy a PostgreSQL database:
- Create a separate PostgreSQL app on Klutch.sh or use an external managed database
- Note the connection details (host, port, database name, username, password)

Update your environment variables:

AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@postgres-host:5432/airflow

Update your Dockerfile to include PostgreSQL support:

The requirements.txt already includes psycopg2-binary for PostgreSQL support.
Initialize the database:

On first deployment, run:
Terminal window
```
airflow db init
```
This is typically handled by your Dockerfile’s CMD, but you can also run it manually if needed.

Monitoring and Logging

Accessing Logs

Task Logs: Available in the Airflow UI under each task instance
System Logs: Accessible through Klutch.sh’s logging interface
Persistent Logs: Stored in the mounted volume at /opt/airflow/logs

Health Checks

Monitor your Airflow deployment:

Webserver health: https://example-app.klutch.sh/health
Scheduler status: Check the Airflow UI’s Admin → Status page
Database connectivity: Verify in the Airflow UI

Best Practices for Production

Use PostgreSQL Database:
- Never use SQLite in production
- Deploy a managed PostgreSQL instance
- Enable connection pooling
Secure Your Deployment:
- Set strong passwords for admin users
- Use AIRFLOW__WEBSERVER__SECRET_KEY and AIRFLOW__CORE__FERNET_KEY
- Enable authentication: AIRFLOW__API__AUTH_BACKEND=airflow.api.auth.backend.basic_auth
- Consider implementing OAuth or LDAP for enterprise environments
Scale with Celery:
- For high workloads, use CeleryExecutor
- Deploy multiple worker instances
- Use Redis or RabbitMQ as the message broker
Implement Proper Logging:
- Configure remote logging to cloud storage (S3, GCS)
- Set appropriate log retention policies
- Monitor disk usage in the logs volume
Resource Allocation:
- Allocate sufficient CPU and memory based on your DAG complexity
- Monitor resource usage and scale as needed
- Use Klutch.sh’s scaling features to adjust compute resources
DAG Best Practices:
- Keep DAGs in version control
- Use variables and connections for configuration
- Implement proper error handling and retries
- Add documentation to your DAGs
Regular Maintenance:
- Update Airflow versions regularly
- Clean up old task instances and logs
- Monitor database size and performance
- Backup your database and volumes regularly
Testing:
- Test DAGs locally before deploying
- Use Airflow’s testing utilities
- Implement CI/CD for DAG validation

Customizing Airflow with Nixpacks

While this guide focuses on Dockerfile deployments, Klutch.sh also uses Nixpacks to automatically build applications without a Dockerfile. If you need to customize build or start commands when not using a Dockerfile, you can set these environment variables:

START_COMMAND: Override the default start command
BUILD_COMMAND: Override the default build command

However, for Airflow deployments, using a Dockerfile is strongly recommended for full control over the environment.

Troubleshooting

Common Issues

Webserver not starting:

Check logs in Klutch.sh dashboard
Verify database connection string
Ensure all required environment variables are set

DAGs not appearing:

Verify DAGs are in /opt/airflow/dags
Check file permissions
Look for parsing errors in logs

Database initialization fails:

Verify PostgreSQL connection
Check database credentials
Ensure database exists and user has proper permissions

Out of disk space:

Increase volume sizes for logs
Implement log rotation
Clean up old task instances

Resources

Deploying Apache Airflow on Klutch.sh provides a powerful, scalable platform for orchestrating your data workflows. With proper configuration, persistent storage, and production best practices, you can run complex DAGs reliably and efficiently.