Skip to content

Deploying Airflow

Apache Airflow is a powerful open-source platform for authoring, scheduling, and monitoring workflows programmatically. Originally developed by Airbnb, Airflow enables data engineers and developers to create complex data pipelines as Directed Acyclic Graphs (DAGs) using Python code. With its rich UI, extensive integrations, and robust scheduling capabilities, Airflow has become the industry standard for orchestrating data workflows, ETL processes, and ML pipelines at scale.

This comprehensive guide walks you through deploying Apache Airflow on Klutch.sh using a Dockerfile. You’ll learn how to set up your environment, configure persistent storage for logs and metadata, manage environment variables, and follow best practices for production deployments.

Prerequisites

Before deploying Airflow to Klutch.sh, ensure you have:

  • A Klutch.sh account
  • A GitHub repository for your Airflow project
  • Basic knowledge of Docker and Python
  • Understanding of workflow orchestration concepts
  • Python 3.8 or higher installed locally for development

Getting Started: Install Airflow Locally

Before deploying to Klutch.sh, it’s recommended to test Airflow locally to understand its structure and create your DAGs.

    1. Create a project directory and set up a virtual environment:

      Terminal window
      mkdir my-airflow-project
      cd my-airflow-project
      python3 -m venv venv
      source venv/bin/activate # On Windows: venv\Scripts\activate
    2. Install Apache Airflow:

      Terminal window
      pip install "apache-airflow==2.7.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt"
    3. Initialize the Airflow database:

      Terminal window
      export AIRFLOW_HOME=$(pwd)/airflow
      airflow db init
    4. Create an admin user:

      Terminal window
      airflow users create \
      --username admin \
      --firstname Admin \
      --lastname User \
      --role Admin \
      --email admin@example.com \
      --password admin
    5. Start the Airflow webserver and scheduler:

      In separate terminal windows (both with your virtual environment activated):

      Terminal window
      # Terminal 1: Start the webserver
      airflow webserver --port 8080
      # Terminal 2: Start the scheduler
      airflow scheduler
    6. Access the Airflow UI:

      Open your browser and navigate to http://localhost:8080. Log in with the credentials you created (username: admin, password: admin).

Sample DAG Code

Create a sample DAG to test your Airflow installation. Create a file named hello_airflow.py in your airflow/dags directory:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
# Define default arguments
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Define the DAG
dag = DAG(
'hello_airflow_dag',
default_args=default_args,
description='A simple Hello World DAG',
schedule_interval=timedelta(days=1),
catchup=False,
tags=['example', 'hello'],
)
def print_hello():
"""Simple Python function to print hello message"""
print("Hello from Airflow on Klutch.sh!")
return "Hello from Airflow!"
def print_date():
"""Print the current date"""
print(f"Current date: {datetime.now()}")
return datetime.now()
# Define tasks
hello_task = PythonOperator(
task_id='hello_task',
python_callable=print_hello,
dag=dag,
)
date_task = PythonOperator(
task_id='date_task',
python_callable=print_date,
dag=dag,
)
bash_task = BashOperator(
task_id='bash_task',
bash_command='echo "Running Bash task on Klutch.sh"',
dag=dag,
)
# Set task dependencies
hello_task >> date_task >> bash_task

This DAG demonstrates basic Airflow concepts including Python operators, Bash operators, and task dependencies. Once your local installation is working, you’re ready to deploy to Klutch.sh.


Deploying Airflow with a Dockerfile

Klutch.sh automatically detects a Dockerfile in your repository’s root directory and uses it to build and deploy your application. This method provides full control over your Airflow installation and dependencies.

    1. Create a Dockerfile in your project root:

      # Use the official Apache Airflow image as base
      FROM apache/airflow:2.7.3-python3.11
      # Set working directory
      WORKDIR /opt/airflow
      # Switch to root to install system dependencies if needed
      USER root
      RUN apt-get update && apt-get install -y --no-install-recommends \
      build-essential \
      && apt-get clean \
      && rm -rf /var/lib/apt/lists/*
      # Switch back to airflow user
      USER airflow
      # Copy requirements file
      COPY requirements.txt /opt/airflow/requirements.txt
      # Install Python dependencies
      RUN pip install --no-cache-dir -r requirements.txt
      # Copy DAGs, plugins, and config files
      COPY dags /opt/airflow/dags
      COPY plugins /opt/airflow/plugins
      # Expose the webserver port
      EXPOSE 8080
      # Set environment variables
      ENV AIRFLOW_HOME=/opt/airflow
      ENV AIRFLOW__CORE__LOAD_EXAMPLES=False
      # Initialize database and start services
      # Note: For production, use a separate database (PostgreSQL)
      CMD ["bash", "-c", "airflow db init && airflow users create --username admin --firstname Admin --lastname User --role Admin --email admin@example.com --password admin || true && airflow webserver & airflow scheduler"]
    2. Create a requirements.txt file:

      apache-airflow==2.7.3
      apache-airflow-providers-http==4.5.0
      apache-airflow-providers-postgres==5.7.1
      psycopg2-binary==2.9.9
      redis==5.0.1
      celery==5.3.4
    3. Create necessary directories:

      Terminal window
      mkdir -p dags plugins logs
    4. Create a .dockerignore file:

      venv/
      __pycache__/
      *.pyc
      *.pyo
      *.pyd
      .git/
      .gitignore
      .env
      .DS_Store
      *.log
      airflow/logs/
    5. Initialize a Git repository and push to GitHub:

      Terminal window
      git init
      git add .
      git commit -m "Initial Airflow setup for Klutch.sh"
      git remote add origin https://github.com/your-username/your-airflow-repo.git
      git push -u origin main
    6. Deploy to Klutch.sh:

      • Log in to Klutch.sh
      • Create a new project or select an existing one
      • Create a new app:
        • Select your GitHub repository
        • Choose the branch you want to deploy (e.g., main)
        • Klutch.sh will automatically detect your Dockerfile
        • Select HTTP as the traffic type
        • Set the internal port to 8080 (Airflow’s default webserver port)
        • Choose your preferred region and compute resources
        • Configure environment variables (see next section)
        • Attach persistent volumes for data persistence
      • Click “Create” to start the deployment

      Klutch.sh will build your Docker image and deploy your Airflow instance. Once deployed, your Airflow webserver will be accessible at a URL like example-app.klutch.sh.


Persistent Storage

Airflow requires persistent storage for several critical directories to maintain state across deployments and restarts. Klutch.sh provides persistent volumes that you can mount to your container.

Required Volumes

When creating your app on Klutch.sh, attach persistent volumes with the following mount paths:

    1. Logs Directory:

      • Mount Path: /opt/airflow/logs
      • Recommended Size: 10-50 GB (depending on your workload)
      • Purpose: Stores task execution logs
    2. Database Directory (if using SQLite):

      • Mount Path: /opt/airflow/airflow.db
      • Recommended Size: 5-10 GB
      • Purpose: Stores Airflow metadata
      • Note: For production, use PostgreSQL instead
    3. DAGs Directory (optional if DAGs are in the image):

      • Mount Path: /opt/airflow/dags
      • Recommended Size: 5 GB
      • Purpose: Store DAG files separately for hot-reloading

Important Notes

  • Volumes in Klutch.sh are created by specifying the mount path and size only
  • Ensure your container has proper permissions to write to these directories
  • For production deployments, use an external PostgreSQL database instead of SQLite
  • Regular backups of your volumes are recommended

Environment Variables

Configure your Airflow deployment using environment variables. Add these in the Klutch.sh app configuration:

Essential Environment Variables

Terminal window
# Core Configuration
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__CORE__FERNET_KEY=<your-generated-fernet-key>
# Database Configuration (for PostgreSQL)
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@your-postgres-host:5432/airflow
# Webserver Configuration
AIRFLOW__WEBSERVER__BASE_URL=https://example-app.klutch.sh
AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080
AIRFLOW__WEBSERVER__EXPOSE_CONFIG=True
# Security
AIRFLOW__WEBSERVER__SECRET_KEY=<your-secret-key>
AIRFLOW__API__AUTH_BACKEND=airflow.api.auth.backend.basic_auth
# Logging
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__REMOTE_LOGGING=False

Generating Required Keys

Generate a Fernet key for encrypting sensitive data:

Terminal window
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Generate a secret key for the webserver:

Terminal window
python -c "import secrets; print(secrets.token_hex(32))"

Using Celery Executor (for Scaling)

If you need to scale Airflow with multiple workers, use the Celery executor:

Terminal window
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CELERY__BROKER_URL=redis://your-redis-host:6379/0
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://username:password@your-postgres-host:5432/airflow

For Celery deployments, you’ll need to deploy separate worker instances. See the Airflow Worker guide for details.


Traffic Type and Port Configuration

When deploying Airflow on Klutch.sh:

  • Traffic Type: Select HTTP in the Klutch.sh dashboard
  • Internal Port: Set to 8080 (Airflow webserver’s default port)
  • External Access: Your app will be accessible via HTTPS at your assigned Klutch.sh URL (e.g., https://example-app.klutch.sh)

Airflow’s webserver listens on port 8080 by default. When you configure the internal port as 8080, Klutch.sh routes external HTTPS traffic to your container’s port 8080.


Using a Production Database

For production deployments, replace SQLite with PostgreSQL:

    1. Deploy a PostgreSQL database:

      • Create a separate PostgreSQL app on Klutch.sh or use an external managed database
      • Note the connection details (host, port, database name, username, password)
    2. Update your environment variables:

      Terminal window
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://username:password@postgres-host:5432/airflow
    3. Update your Dockerfile to include PostgreSQL support:

      The requirements.txt already includes psycopg2-binary for PostgreSQL support.

    4. Initialize the database:

      On first deployment, run:

      Terminal window
      airflow db init

      This is typically handled by your Dockerfile’s CMD, but you can also run it manually if needed.


Monitoring and Logging

Accessing Logs

  • Task Logs: Available in the Airflow UI under each task instance
  • System Logs: Accessible through Klutch.sh’s logging interface
  • Persistent Logs: Stored in the mounted volume at /opt/airflow/logs

Health Checks

Monitor your Airflow deployment:

  • Webserver health: https://example-app.klutch.sh/health
  • Scheduler status: Check the Airflow UI’s Admin → Status page
  • Database connectivity: Verify in the Airflow UI

Best Practices for Production

    1. Use PostgreSQL Database:

      • Never use SQLite in production
      • Deploy a managed PostgreSQL instance
      • Enable connection pooling
    2. Secure Your Deployment:

      • Set strong passwords for admin users
      • Use AIRFLOW__WEBSERVER__SECRET_KEY and AIRFLOW__CORE__FERNET_KEY
      • Enable authentication: AIRFLOW__API__AUTH_BACKEND=airflow.api.auth.backend.basic_auth
      • Consider implementing OAuth or LDAP for enterprise environments
    3. Scale with Celery:

      • For high workloads, use CeleryExecutor
      • Deploy multiple worker instances
      • Use Redis or RabbitMQ as the message broker
    4. Implement Proper Logging:

      • Configure remote logging to cloud storage (S3, GCS)
      • Set appropriate log retention policies
      • Monitor disk usage in the logs volume
    5. Resource Allocation:

      • Allocate sufficient CPU and memory based on your DAG complexity
      • Monitor resource usage and scale as needed
      • Use Klutch.sh’s scaling features to adjust compute resources
    6. DAG Best Practices:

      • Keep DAGs in version control
      • Use variables and connections for configuration
      • Implement proper error handling and retries
      • Add documentation to your DAGs
    7. Regular Maintenance:

      • Update Airflow versions regularly
      • Clean up old task instances and logs
      • Monitor database size and performance
      • Backup your database and volumes regularly
    8. Testing:

      • Test DAGs locally before deploying
      • Use Airflow’s testing utilities
      • Implement CI/CD for DAG validation

Customizing Airflow with Nixpacks

While this guide focuses on Dockerfile deployments, Klutch.sh also uses Nixpacks to automatically build applications without a Dockerfile. If you need to customize build or start commands when not using a Dockerfile, you can set these environment variables:

  • START_COMMAND: Override the default start command
  • BUILD_COMMAND: Override the default build command

However, for Airflow deployments, using a Dockerfile is strongly recommended for full control over the environment.


Troubleshooting

Common Issues

Webserver not starting:

  • Check logs in Klutch.sh dashboard
  • Verify database connection string
  • Ensure all required environment variables are set

DAGs not appearing:

  • Verify DAGs are in /opt/airflow/dags
  • Check file permissions
  • Look for parsing errors in logs

Database initialization fails:

  • Verify PostgreSQL connection
  • Check database credentials
  • Ensure database exists and user has proper permissions

Out of disk space:

  • Increase volume sizes for logs
  • Implement log rotation
  • Clean up old task instances

Resources


Deploying Apache Airflow on Klutch.sh provides a powerful, scalable platform for orchestrating your data workflows. With proper configuration, persistent storage, and production best practices, you can run complex DAGs reliably and efficiently.