Deploying ClickHouse with S3 Storage
Introduction
ClickHouse is a high-performance, open-source column-oriented database management system designed for analytics. When integrated with S3 storage, ClickHouse enables cost-effective storage of massive datasets while maintaining fast query performance through intelligent caching and tiered storage.
Deploying ClickHouse with S3 on Klutch.sh provides:
- Scalable analytics infrastructure with S3 object storage
- Separation of compute and storage for flexible scaling
- Persistent storage for database files and configuration
- Automated deployment with Docker
Prerequisites
Before deploying ClickHouse with S3 storage, ensure you have:
- A Klutch.sh account
- A GitHub account with a repository
- Access to an S3-compatible storage service (AWS S3, MinIO, Backblaze B2, etc.)
- S3 access credentials (access key ID and secret access key)
- Basic knowledge of Docker and databases
Installation and Setup
Step 1: Create Your Project Directory
Create a new directory for your ClickHouse deployment:
mkdir clickhouse-s3-klutchcd clickhouse-s3-klutchgit initStep 2: Create the Dockerfile
Create a Dockerfile in your project root:
FROM clickhouse/clickhouse-server:latest
# Create directoriesRUN mkdir -p /var/lib/clickhouse \ /var/log/clickhouse-server \ /etc/clickhouse-server/config.d
# Copy S3 configurationCOPY s3_storage.xml /etc/clickhouse-server/config.d/
# Set permissionsRUN chown -R clickhouse:clickhouse /var/lib/clickhouse \ /var/log/clickhouse-server \ /etc/clickhouse-server
# Expose ClickHouse portEXPOSE 9000
USER clickhouseCMD ["/usr/bin/clickhouse-server", "--config-file=/etc/clickhouse-server/config.xml"]Step 3: Create S3 Storage Configuration
Create s3_storage.xml for S3 integration:
<clickhouse> <storage_configuration> <disks> <default> <keep_free_space_bytes>1073741824</keep_free_space_bytes> </default>
<s3> <type>s3</type> <endpoint>https://s3.amazonaws.com/your-bucket/clickhouse/</endpoint> <access_key_id>YOUR_ACCESS_KEY</access_key_id> <secret_access_key>YOUR_SECRET_KEY</secret_access_key> <region>us-east-1</region> <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path> <cache_enabled>true</cache_enabled> <cache_path>/var/lib/clickhouse/disks/s3_cache/</cache_path> <cache_size>10737418240</cache_size> </s3> </disks>
<policies> <tiered_storage> <volumes> <hot> <disk>default</disk> </hot> <cold> <disk>s3</disk> </cold> </volumes> </tiered_storage> </policies> </storage_configuration></clickhouse>Replace your-bucket, YOUR_ACCESS_KEY, and YOUR_SECRET_KEY with your S3 credentials.
Step 4: Push to GitHub
Commit your files to GitHub:
git add Dockerfile s3_storage.xmlgit commit -m "Add ClickHouse with S3 configuration"git remote add origin https://github.com/yourusername/clickhouse-s3.gitgit push -u origin mainDeploying to Klutch.sh
Deploy your ClickHouse instance with S3 storage on Klutch.sh.
Deployment Steps
-
Log in to Klutch.sh
Navigate to klutch.sh/app and sign in.
-
Create a New App
Click “Create App” and connect your GitHub repository containing the Dockerfile.
-
Configure Traffic Type
- Traffic Type: Select TCP
- Internal Port: Set to
9000(ClickHouse native protocol port)
Your application connects to ClickHouse on external port 8000, which routes to internal port 9000.
-
Set Environment Variables
Configure these environment variables in the Klutch.sh dashboard:
S3_ENDPOINT: Your S3 endpoint (e.g.,https://s3.amazonaws.com)S3_BUCKET: S3 bucket name for ClickHouse dataS3_ACCESS_KEY_ID: S3 access keyS3_SECRET_ACCESS_KEY: S3 secret keyS3_REGION: S3 region (e.g.,us-east-1)CLICKHOUSE_USER: Admin username (default:default)CLICKHOUSE_PASSWORD: Strong password for admin user
-
Attach a Persistent Volume
- Click “Add Volume”
- Mount Path:
/var/lib/clickhouse - Size: Choose based on cache needs (minimum 10GB, recommended 50-100GB)
The volume stores system tables, metadata, and S3 cache for better performance.
-
Deploy
Click “Create” to deploy. Klutch.sh will build your Docker image and start ClickHouse with S3 integration.
-
Connect to ClickHouse
Once deployed, connect using:
Terminal window clickhouse-client --host example-app.klutch.sh --port 8000 --user default --password your_password
Using S3 Storage
Creating Tables with Tiered Storage
Create tables that automatically move data to S3:
CREATE TABLE events( event_time DateTime, user_id UInt64, event_type String, data String)ENGINE = MergeTree()PARTITION BY toYYYYMM(event_time)ORDER BY (user_id, event_time)TTL event_time + INTERVAL 30 DAY TO DISK 's3'SETTINGS storage_policy = 'tiered_storage';This configuration keeps recent data local for fast access and moves older data to S3 after 30 days.
S3-Only Storage
For archive data, use S3-only storage:
CREATE TABLE archive_data( date Date, data String)ENGINE = MergeTree()ORDER BY dateSETTINGS storage_policy = 's3_only';Verify S3 Integration
Check that S3 storage is configured correctly:
-- View configured disksSELECT name, path FROM system.disks;
-- View storage policiesSELECT policy_name, disks FROM system.storage_policies;Environment Variables
Configure these variables in the Klutch.sh dashboard:
S3_ENDPOINT- S3 endpoint URL (e.g.,https://s3.amazonaws.com)S3_BUCKET- S3 bucket nameS3_ACCESS_KEY_ID- S3 access keyS3_SECRET_ACCESS_KEY- S3 secret keyS3_REGION- S3 region (e.g.,us-east-1)CLICKHOUSE_USER- Admin username (default:default)CLICKHOUSE_PASSWORD- Strong admin password
Production Best Practices
Security
- Use strong passwords for ClickHouse authentication
- Enable S3 server-side encryption
- Store credentials as environment variables in Klutch.sh
- Regularly update ClickHouse to the latest version
Performance
- Allocate sufficient memory (minimum 2GB, recommended 8GB+)
- Configure adequate local cache size for S3 data
- Use appropriate partitioning strategies
- Monitor S3 API calls to optimize costs
Monitoring
Monitor key metrics for your deployment:
- Query execution times
- Memory and CPU usage
- S3 API call rates
- Storage utilization
Check system metrics:
-- View running queriesSELECT query_id, elapsed, query FROM system.processes;
-- Check storage usageSELECT name, path, free_space, total_space FROM system.disks;Troubleshooting
Cannot Connect to ClickHouse
- Verify app is running in Klutch.sh dashboard
- Confirm TCP traffic type is selected
- Check internal port is set to 9000
- Use port 8000 for external connections
S3 Connection Errors
- Verify S3 credentials in environment variables
- Check S3 bucket exists and is accessible
- Confirm S3 endpoint URL is correct
- Test S3 connectivity:
SELECT * FROM s3('https://your-bucket/test.csv', 'access_key', 'secret_key', 'CSV');Slow Query Performance
- Increase local cache size for S3 disk
- Optimize table ORDER BY and PARTITION BY clauses
- Check if data is cached locally:
SELECT disk_name, count() FROM system.partsWHERE database = 'your_db'GROUP BY disk_name;