Skip to content

Deploying ClickHouse with S3 Storage

Introduction

ClickHouse is a high-performance, open-source column-oriented database management system designed for analytics. When integrated with S3 storage, ClickHouse enables cost-effective storage of massive datasets while maintaining fast query performance through intelligent caching and tiered storage.

Deploying ClickHouse with S3 on Klutch.sh provides:

  • Scalable analytics infrastructure with S3 object storage
  • Separation of compute and storage for flexible scaling
  • Persistent storage for database files and configuration
  • Automated deployment with Docker

Prerequisites

Before deploying ClickHouse with S3 storage, ensure you have:

  • A Klutch.sh account
  • A GitHub account with a repository
  • Access to an S3-compatible storage service (AWS S3, MinIO, Backblaze B2, etc.)
  • S3 access credentials (access key ID and secret access key)
  • Basic knowledge of Docker and databases

Installation and Setup

Step 1: Create Your Project Directory

Create a new directory for your ClickHouse deployment:

Terminal window
mkdir clickhouse-s3-klutch
cd clickhouse-s3-klutch
git init

Step 2: Create the Dockerfile

Create a Dockerfile in your project root:

FROM clickhouse/clickhouse-server:latest
# Create directories
RUN mkdir -p /var/lib/clickhouse \
/var/log/clickhouse-server \
/etc/clickhouse-server/config.d
# Copy S3 configuration
COPY s3_storage.xml /etc/clickhouse-server/config.d/
# Set permissions
RUN chown -R clickhouse:clickhouse /var/lib/clickhouse \
/var/log/clickhouse-server \
/etc/clickhouse-server
# Expose ClickHouse port
EXPOSE 9000
USER clickhouse
CMD ["/usr/bin/clickhouse-server", "--config-file=/etc/clickhouse-server/config.xml"]

Step 3: Create S3 Storage Configuration

Create s3_storage.xml for S3 integration:

<clickhouse>
<storage_configuration>
<disks>
<default>
<keep_free_space_bytes>1073741824</keep_free_space_bytes>
</default>
<s3>
<type>s3</type>
<endpoint>https://s3.amazonaws.com/your-bucket/clickhouse/</endpoint>
<access_key_id>YOUR_ACCESS_KEY</access_key_id>
<secret_access_key>YOUR_SECRET_KEY</secret_access_key>
<region>us-east-1</region>
<metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
<cache_enabled>true</cache_enabled>
<cache_path>/var/lib/clickhouse/disks/s3_cache/</cache_path>
<cache_size>10737418240</cache_size>
</s3>
</disks>
<policies>
<tiered_storage>
<volumes>
<hot>
<disk>default</disk>
</hot>
<cold>
<disk>s3</disk>
</cold>
</volumes>
</tiered_storage>
</policies>
</storage_configuration>
</clickhouse>

Replace your-bucket, YOUR_ACCESS_KEY, and YOUR_SECRET_KEY with your S3 credentials.

Step 4: Push to GitHub

Commit your files to GitHub:

Terminal window
git add Dockerfile s3_storage.xml
git commit -m "Add ClickHouse with S3 configuration"
git remote add origin https://github.com/yourusername/clickhouse-s3.git
git push -u origin main

Deploying to Klutch.sh

Deploy your ClickHouse instance with S3 storage on Klutch.sh.

Deployment Steps

    1. Log in to Klutch.sh

      Navigate to klutch.sh/app and sign in.

    2. Create a New App

      Click “Create App” and connect your GitHub repository containing the Dockerfile.

    3. Configure Traffic Type

      • Traffic Type: Select TCP
      • Internal Port: Set to 9000 (ClickHouse native protocol port)

      Your application connects to ClickHouse on external port 8000, which routes to internal port 9000.

    4. Set Environment Variables

      Configure these environment variables in the Klutch.sh dashboard:

      • S3_ENDPOINT: Your S3 endpoint (e.g., https://s3.amazonaws.com)
      • S3_BUCKET: S3 bucket name for ClickHouse data
      • S3_ACCESS_KEY_ID: S3 access key
      • S3_SECRET_ACCESS_KEY: S3 secret key
      • S3_REGION: S3 region (e.g., us-east-1)
      • CLICKHOUSE_USER: Admin username (default: default)
      • CLICKHOUSE_PASSWORD: Strong password for admin user
    5. Attach a Persistent Volume

      • Click “Add Volume”
      • Mount Path: /var/lib/clickhouse
      • Size: Choose based on cache needs (minimum 10GB, recommended 50-100GB)

      The volume stores system tables, metadata, and S3 cache for better performance.

    6. Deploy

      Click “Create” to deploy. Klutch.sh will build your Docker image and start ClickHouse with S3 integration.

    7. Connect to ClickHouse

      Once deployed, connect using:

      Terminal window
      clickhouse-client --host example-app.klutch.sh --port 8000 --user default --password your_password

Using S3 Storage

Creating Tables with Tiered Storage

Create tables that automatically move data to S3:

CREATE TABLE events
(
event_time DateTime,
user_id UInt64,
event_type String,
data String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time)
TTL event_time + INTERVAL 30 DAY TO DISK 's3'
SETTINGS storage_policy = 'tiered_storage';

This configuration keeps recent data local for fast access and moves older data to S3 after 30 days.

S3-Only Storage

For archive data, use S3-only storage:

CREATE TABLE archive_data
(
date Date,
data String
)
ENGINE = MergeTree()
ORDER BY date
SETTINGS storage_policy = 's3_only';

Verify S3 Integration

Check that S3 storage is configured correctly:

-- View configured disks
SELECT name, path FROM system.disks;
-- View storage policies
SELECT policy_name, disks FROM system.storage_policies;

Environment Variables

Configure these variables in the Klutch.sh dashboard:

  • S3_ENDPOINT - S3 endpoint URL (e.g., https://s3.amazonaws.com)
  • S3_BUCKET - S3 bucket name
  • S3_ACCESS_KEY_ID - S3 access key
  • S3_SECRET_ACCESS_KEY - S3 secret key
  • S3_REGION - S3 region (e.g., us-east-1)
  • CLICKHOUSE_USER - Admin username (default: default)
  • CLICKHOUSE_PASSWORD - Strong admin password

Production Best Practices

Security

  • Use strong passwords for ClickHouse authentication
  • Enable S3 server-side encryption
  • Store credentials as environment variables in Klutch.sh
  • Regularly update ClickHouse to the latest version

Performance

  • Allocate sufficient memory (minimum 2GB, recommended 8GB+)
  • Configure adequate local cache size for S3 data
  • Use appropriate partitioning strategies
  • Monitor S3 API calls to optimize costs

Monitoring

Monitor key metrics for your deployment:

  • Query execution times
  • Memory and CPU usage
  • S3 API call rates
  • Storage utilization

Check system metrics:

-- View running queries
SELECT query_id, elapsed, query FROM system.processes;
-- Check storage usage
SELECT name, path, free_space, total_space FROM system.disks;

Troubleshooting

Cannot Connect to ClickHouse

  • Verify app is running in Klutch.sh dashboard
  • Confirm TCP traffic type is selected
  • Check internal port is set to 9000
  • Use port 8000 for external connections

S3 Connection Errors

  • Verify S3 credentials in environment variables
  • Check S3 bucket exists and is accessible
  • Confirm S3 endpoint URL is correct
  • Test S3 connectivity:
SELECT * FROM s3('https://your-bucket/test.csv', 'access_key', 'secret_key', 'CSV');

Slow Query Performance

  • Increase local cache size for S3 disk
  • Optimize table ORDER BY and PARTITION BY clauses
  • Check if data is cached locally:
SELECT disk_name, count() FROM system.parts
WHERE database = 'your_db'
GROUP BY disk_name;

Resources