Deploying ClickHouse with S3 Storage

Introduction

ClickHouse is a high-performance, open-source column-oriented database management system designed for analytics. When integrated with S3 storage, ClickHouse enables cost-effective storage of massive datasets while maintaining fast query performance through intelligent caching and tiered storage.

Deploying ClickHouse with S3 on Klutch.sh provides:

Scalable analytics infrastructure with S3 object storage
Separation of compute and storage for flexible scaling
Persistent storage for database files and configuration
Automated deployment with Docker

Prerequisites

Before deploying ClickHouse with S3 storage, ensure you have:

A Klutch.sh account
A GitHub account with a repository
Access to an S3-compatible storage service (AWS S3, MinIO, Backblaze B2, etc.)
S3 access credentials (access key ID and secret access key)
Basic knowledge of Docker and databases

Installation and Setup

Step 1: Create Your Project Directory

Create a new directory for your ClickHouse deployment:

mkdir clickhouse-s3-klutch
cd clickhouse-s3-klutch
git init

Step 2: Create the Dockerfile

Create a Dockerfile in your project root:

FROM clickhouse/clickhouse-server:latest

# Create directories
RUN mkdir -p /var/lib/clickhouse \
    /var/log/clickhouse-server \
    /etc/clickhouse-server/config.d

# Copy S3 configuration
COPY s3_storage.xml /etc/clickhouse-server/config.d/

# Set permissions
RUN chown -R clickhouse:clickhouse /var/lib/clickhouse \
    /var/log/clickhouse-server \
    /etc/clickhouse-server

# Expose ClickHouse port
EXPOSE 9000

USER clickhouse
CMD ["/usr/bin/clickhouse-server", "--config-file=/etc/clickhouse-server/config.xml"]

Step 3: Create S3 Storage Configuration

Create s3_storage.xml for S3 integration:

<clickhouse>
    <storage_configuration>
        <disks>
            <default>
                <keep_free_space_bytes>1073741824</keep_free_space_bytes>
            </default>

            <s3>
                <type>s3</type>
                <endpoint>https://s3.amazonaws.com/your-bucket/clickhouse/</endpoint>
                <access_key_id>YOUR_ACCESS_KEY</access_key_id>
                <secret_access_key>YOUR_SECRET_KEY</secret_access_key>
                <region>us-east-1</region>
                <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
                <cache_enabled>true</cache_enabled>
                <cache_path>/var/lib/clickhouse/disks/s3_cache/</cache_path>
                <cache_size>10737418240</cache_size>
            </s3>
        </disks>

        <policies>
            <tiered_storage>
                <volumes>
                    <hot>
                        <disk>default</disk>
                    </hot>
                    <cold>
                        <disk>s3</disk>
                    </cold>
                </volumes>
            </tiered_storage>
        </policies>
    </storage_configuration>
</clickhouse>

Replace your-bucket, YOUR_ACCESS_KEY, and YOUR_SECRET_KEY with your S3 credentials.

Step 4: Push to GitHub

Commit your files to GitHub:

git add Dockerfile s3_storage.xml
git commit -m "Add ClickHouse with S3 configuration"
git remote add origin https://github.com/yourusername/clickhouse-s3.git
git push -u origin main

Deploying to Klutch.sh

Deploy your ClickHouse instance with S3 storage on Klutch.sh.

Deployment Steps

Log in to Klutch.sh

Navigate to klutch.sh/app and sign in.
Create a New App

Click “Create App” and connect your GitHub repository containing the Dockerfile.
Configure Traffic Type
- Traffic Type: Select TCP
- Internal Port: Set to 9000 (ClickHouse native protocol port)
Your application connects to ClickHouse on external port 8000, which routes to internal port 9000.
Set Environment Variables

Configure these environment variables in the Klutch.sh dashboard:
- S3_ENDPOINT: Your S3 endpoint (e.g., https://s3.amazonaws.com)
- S3_BUCKET: S3 bucket name for ClickHouse data
- S3_ACCESS_KEY_ID: S3 access key
- S3_SECRET_ACCESS_KEY: S3 secret key
- S3_REGION: S3 region (e.g., us-east-1)
- CLICKHOUSE_USER: Admin username (default: default)
- CLICKHOUSE_PASSWORD: Strong password for admin user
Attach a Persistent Volume
- Click “Add Volume”
- Mount Path: /var/lib/clickhouse
- Size: Choose based on cache needs (minimum 10GB, recommended 50-100GB)
The volume stores system tables, metadata, and S3 cache for better performance.
Deploy

Click “Create” to deploy. Klutch.sh will build your Docker image and start ClickHouse with S3 integration.

Connect to ClickHouse

Once deployed, connect using:

clickhouse-client --host example-app.klutch.sh --port 8000 --user default --password your_password

Using S3 Storage

Creating Tables with Tiered Storage

Create tables that automatically move data to S3:

CREATE TABLE events
(
    event_time DateTime,
    user_id UInt64,
    event_type String,
    data String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time)
TTL event_time + INTERVAL 30 DAY TO DISK 's3'
SETTINGS storage_policy = 'tiered_storage';

This configuration keeps recent data local for fast access and moves older data to S3 after 30 days.

S3-Only Storage

For archive data, use S3-only storage:

CREATE TABLE archive_data
(
    date Date,
    data String
)
ENGINE = MergeTree()
ORDER BY date
SETTINGS storage_policy = 's3_only';

Verify S3 Integration

Check that S3 storage is configured correctly:

-- View configured disks
SELECT name, path FROM system.disks;

-- View storage policies
SELECT policy_name, disks FROM system.storage_policies;

Environment Variables

Configure these variables in the Klutch.sh dashboard:

S3_ENDPOINT - S3 endpoint URL (e.g., https://s3.amazonaws.com)
S3_BUCKET - S3 bucket name
S3_ACCESS_KEY_ID - S3 access key
S3_SECRET_ACCESS_KEY - S3 secret key
S3_REGION - S3 region (e.g., us-east-1)
CLICKHOUSE_USER - Admin username (default: default)
CLICKHOUSE_PASSWORD - Strong admin password

Production Best Practices

Security

Use strong passwords for ClickHouse authentication
Enable S3 server-side encryption
Store credentials as environment variables in Klutch.sh
Regularly update ClickHouse to the latest version

Performance

Allocate sufficient memory (minimum 2GB, recommended 8GB+)
Configure adequate local cache size for S3 data
Use appropriate partitioning strategies
Monitor S3 API calls to optimize costs

Monitoring

Monitor key metrics for your deployment:

Query execution times
Memory and CPU usage
S3 API call rates
Storage utilization

Check system metrics:

-- View running queries
SELECT query_id, elapsed, query FROM system.processes;

-- Check storage usage
SELECT name, path, free_space, total_space FROM system.disks;

Troubleshooting

Cannot Connect to ClickHouse

Verify app is running in Klutch.sh dashboard
Confirm TCP traffic type is selected
Check internal port is set to 9000
Use port 8000 for external connections

S3 Connection Errors

Verify S3 credentials in environment variables
Check S3 bucket exists and is accessible
Confirm S3 endpoint URL is correct
Test S3 connectivity:

SELECT * FROM s3('https://your-bucket/test.csv', 'access_key', 'secret_key', 'CSV');

Slow Query Performance

Increase local cache size for S3 disk
Optimize table ORDER BY and PARTITION BY clauses
Check if data is cached locally:

SELECT disk_name, count() FROM system.parts
WHERE database = 'your_db'
GROUP BY disk_name;