Deploying a ClickHouse Database
Introduction
ClickHouse is an open-source, column-oriented database management system (DBMS) designed for online analytical processing (OLAP). Developed by Yandex in 2016, ClickHouse has rapidly become one of the fastest and most efficient databases for real-time analytics, capable of processing billions of rows and handling petabytes of data with exceptional query performance.
ClickHouse is renowned for its:
- Blazing Fast Query Performance: Processes billions of rows per second with sub-second query response times
- Column-Oriented Storage: Optimized for analytical queries, reading only necessary columns for tremendous speed gains
- Data Compression: Achieves 10x-100x compression ratios, significantly reducing storage requirements
- Distributed Architecture: Native support for sharding and replication across multiple nodes
- SQL Support: Full SQL compatibility with powerful extensions for analytics and time-series data
- Real-Time Ingestion: Handles millions of inserts per second with immediate data availability for queries
- Cost-Effective: Exceptional performance-to-cost ratio compared to traditional data warehouses
Common use cases include web and application analytics, log analysis, time-series data processing, business intelligence dashboards, real-time monitoring, and data warehousing for petabyte-scale datasets.
This comprehensive guide walks you through deploying ClickHouse on Klutch.sh using Docker, including detailed installation steps, sample configurations, and production-ready best practices for persistent storage and optimal performance.
Prerequisites
Before you begin, ensure you have the following:
- A Klutch.sh account
- A GitHub account with a repository for your ClickHouse project
- Docker installed locally for testing (optional but recommended)
- Basic understanding of Docker and database concepts
Installation and Setup
Step 1: Create Your Project Directory
First, create a new directory for your ClickHouse deployment project:
mkdir clickhouse-klutchcd clickhouse-klutchgit initStep 2: Create the Dockerfile
Create a Dockerfile in your project root directory. This will define your ClickHouse container configuration:
FROM clickhouse/clickhouse-server:24-alpine
# Set default environment variables# These can be overridden in the Klutch.sh dashboardENV CLICKHOUSE_DB=defaultENV CLICKHOUSE_USER=defaultENV CLICKHOUSE_PASSWORD=ClickHouse123!
# Expose the HTTP interface port (primary)EXPOSE 8123
# Expose the native protocol port (for clickhouse-client)EXPOSE 9000
# Optional: Copy custom configuration files# Configuration files placed in /etc/clickhouse-server/config.d/# will be merged with the default configuration# COPY ./config.xml /etc/clickhouse-server/config.d/custom-config.xml# COPY ./users.xml /etc/clickhouse-server/users.d/custom-users.xmlNote: The ClickHouse Alpine image is lightweight and recommended for production deployments. ClickHouse exposes multiple ports: 8123 for HTTP interface (REST API), 9000 for native protocol (CLI and drivers), 9009 for interserver communication, and 8443/9440 for HTTPS/TLS.
Step 3: (Optional) Create Custom Configuration
For production deployments, you can create custom configuration files. Create a file named config.xml:
<!-- config.xml - Custom ClickHouse Configuration --><clickhouse> <!-- Listen on all interfaces --> <listen_host>::</listen_host>
<!-- Logging configuration --> <logger> <level>information</level> <console>true</console> </logger>
<!-- Query settings --> <max_concurrent_queries>100</max_concurrent_queries> <max_server_memory_usage_to_ram_ratio>0.9</max_server_memory_usage_to_ram_ratio>
<!-- Performance tuning --> <mark_cache_size>5368709120</mark_cache_size> <max_table_size_to_drop>0</max_table_size_to_drop>
<!-- Enable query log for monitoring --> <query_log> <database>system</database> <table>query_log</table> </query_log></clickhouse>You can also create a custom users configuration file named users.xml:
<!-- users.xml - Custom User Configuration --><clickhouse> <users> <default> <password_sha256_hex><!-- SHA256 hash of your password --></password_sha256_hex> <networks> <ip>::/0</ip> </networks> <profile>default</profile> <quota>default</quota> </default> </users></clickhouse>If you create custom configuration files, uncomment the COPY lines in your Dockerfile to include them.
Step 4: (Optional) Create Initialization Scripts
ClickHouse supports initialization scripts that run when the database first starts. Create a file named init.sql:
-- init.sql - Database initialization script-- This script runs automatically on first startup
-- Create a database for analyticsCREATE DATABASE IF NOT EXISTS analytics;
-- Create a sample events table with MergeTree engineCREATE TABLE IF NOT EXISTS analytics.events ( event_time DateTime, event_date Date DEFAULT toDate(event_time), user_id UInt32, event_type String, page_url String, country_code FixedString(2), device_type LowCardinality(String), session_id String) ENGINE = MergeTree()PARTITION BY toYYYYMM(event_date)ORDER BY (event_date, user_id, event_time)SETTINGS index_granularity = 8192;
-- Create a materialized view for aggregated statisticsCREATE MATERIALIZED VIEW IF NOT EXISTS analytics.daily_statsENGINE = SummingMergeTree()PARTITION BY toYYYYMM(event_date)ORDER BY (event_date, event_type)AS SELECT event_date, event_type, count() AS event_count, uniq(user_id) AS unique_usersFROM analytics.eventsGROUP BY event_date, event_type;
-- Create an index for faster queriesCREATE INDEX IF NOT EXISTS idx_user_id ON analytics.events (user_id) TYPE minmax GRANULARITY 1;
-- Insert sample dataINSERT INTO analytics.events (event_time, user_id, event_type, page_url, country_code, device_type, session_id) VALUES (now(), 1001, 'page_view', '/home', 'US', 'desktop', 'session_abc123'), (now(), 1002, 'click', '/products', 'UK', 'mobile', 'session_def456'), (now(), 1003, 'page_view', '/about', 'CA', 'tablet', 'session_ghi789');To include initialization scripts in your deployment, update your Dockerfile:
FROM clickhouse/clickhouse-server:24-alpine
ENV CLICKHOUSE_DB=defaultENV CLICKHOUSE_USER=defaultENV CLICKHOUSE_PASSWORD=ClickHouse123!
EXPOSE 8123EXPOSE 9000
# Copy initialization scriptsCOPY ./init.sql /docker-entrypoint-initdb.d/Step 5: Test Locally (Optional)
Before deploying to Klutch.sh, you can test your ClickHouse setup locally:
# Build the Docker imagedocker build -t my-clickhouse .
# Run the containerdocker run -d \ --name clickhouse-test \ -p 8123:8123 \ -p 9000:9000 \ -e CLICKHOUSE_PASSWORD=mysecretpassword \ my-clickhouse
# Wait a few seconds for ClickHouse to startsleep 5
# Test the HTTP interfacecurl 'http://localhost:8123/?query=SELECT%20version()'
# Or connect with clickhouse-client (if installed)clickhouse-client --host localhost --port 9000 --user default --password mysecretpassword --query "SELECT version()"
# Stop and remove the test container when donedocker stop clickhouse-testdocker rm clickhouse-testStep 6: Push to GitHub
Commit your Dockerfile and any configuration files to your GitHub repository:
git add Dockerfile init.sql config.xml users.xmlgit commit -m "Add ClickHouse Dockerfile and configuration files"git remote add origin https://github.com/yourusername/clickhouse-klutch.gitgit push -u origin mainConnecting to ClickHouse
Once deployed, you can connect to your ClickHouse database from any application using the HTTP interface or native protocol. Since Klutch.sh routes TCP traffic through port 8000, use the following connection methods:
HTTP Interface (REST API)
The HTTP interface is the simplest way to interact with ClickHouse and works with any HTTP client:
# Query using curlcurl 'http://example-app.klutch.sh:8000/?query=SELECT%20version()'
# Insert datacurl 'http://example-app.klutch.sh:8000/?query=INSERT%20INTO%20analytics.events%20VALUES' \ --data-binary @data.csv
# With authenticationcurl -u default:ClickHouse123! \ 'http://example-app.klutch.sh:8000/?query=SELECT%20count()%20FROM%20analytics.events'Connection String Format
For TCP connections (native protocol):
clickhouse://default:ClickHouse123!@example-app.klutch.sh:8000/defaultReplace:
defaultwith your database usernameClickHouse123!with your database passwordexample-app.klutch.shwith your actual Klutch.sh app URLdefaultat the end with your database name
Example Connection Code
Node.js (using @clickhouse/client):
const { createClient } = require('@clickhouse/client');
const client = createClient({ host: 'http://example-app.klutch.sh:8000', username: 'default', password: 'ClickHouse123!', database: 'default',});
async function query() { const resultSet = await client.query({ query: 'SELECT version()', format: 'JSONEachRow', });
const data = await resultSet.json(); console.log('ClickHouse version:', data);}
query().catch(console.error);Python (using clickhouse-driver):
from clickhouse_driver import Client
client = Client( host='example-app.klutch.sh', port=8000, user='default', password='ClickHouse123!', database='default')
# Execute queryresult = client.execute('SELECT version()')print('ClickHouse version:', result)
# Insert dataclient.execute( 'INSERT INTO analytics.events (event_time, user_id, event_type, page_url, country_code, device_type, session_id) VALUES', [(datetime.now(), 1004, 'page_view', '/products', 'US', 'mobile', 'session_xyz')])
# Query with parametersevents = client.execute( 'SELECT * FROM analytics.events WHERE user_id = %(user_id)s', {'user_id': 1001})Python (using HTTP client - requests):
import requestsimport json
# Define connection parametersbase_url = 'http://example-app.klutch.sh:8000'auth = ('default', 'ClickHouse123!')
# Execute queryresponse = requests.get( f'{base_url}/', params={'query': 'SELECT version()'}, auth=auth)print('ClickHouse version:', response.text)
# Insert JSON datadata = [ {'event_time': '2024-01-15 10:00:00', 'user_id': 1005, 'event_type': 'click', 'page_url': '/checkout', 'country_code': 'FR', 'device_type': 'desktop', 'session_id': 'session_123'}]
response = requests.post( f'{base_url}/', params={'query': 'INSERT INTO analytics.events FORMAT JSONEachRow'}, data='\n'.join(json.dumps(row) for row in data), auth=auth)Go (using clickhouse-go):
package main
import ( "context" "fmt" "log" "time"
"github.com/ClickHouse/clickhouse-go/v2")
func main() { conn, err := clickhouse.Open(&clickhouse.Options{ Addr: []string{"example-app.klutch.sh:8000"}, Auth: clickhouse.Auth{ Database: "default", Username: "default", Password: "ClickHouse123!", }, Protocol: clickhouse.HTTP, }) if err != nil { log.Fatal(err) } defer conn.Close()
// Query data var version string if err := conn.QueryRow(context.Background(), "SELECT version()").Scan(&version); err != nil { log.Fatal(err) } fmt.Printf("ClickHouse version: %s\n", version)
// Insert data batch, err := conn.PrepareBatch(context.Background(), "INSERT INTO analytics.events") if err != nil { log.Fatal(err) }
err = batch.Append( time.Now(), uint32(1006), "page_view", "/products", "US", "mobile", "session_abc", ) if err != nil { log.Fatal(err) }
if err := batch.Send(); err != nil { log.Fatal(err) }}Java (using clickhouse-jdbc):
import com.clickhouse.jdbc.ClickHouseDataSource;import java.sql.*;import java.util.Properties;
public class ClickHouseExample { public static void main(String[] args) throws SQLException { String url = "jdbc:clickhouse://example-app.klutch.sh:8000/default"; Properties properties = new Properties(); properties.setProperty("user", "default"); properties.setProperty("password", "ClickHouse123!");
ClickHouseDataSource dataSource = new ClickHouseDataSource(url, properties);
try (Connection conn = dataSource.getConnection(); Statement stmt = conn.createStatement()) {
// Query data ResultSet rs = stmt.executeQuery("SELECT version()"); if (rs.next()) { System.out.println("ClickHouse version: " + rs.getString(1)); }
// Insert data PreparedStatement pstmt = conn.prepareStatement( "INSERT INTO analytics.events (event_time, user_id, event_type, page_url, country_code, device_type, session_id) VALUES (?, ?, ?, ?, ?, ?, ?)" ); pstmt.setTimestamp(1, new Timestamp(System.currentTimeMillis())); pstmt.setInt(2, 1007); pstmt.setString(3, "click"); pstmt.setString(4, "/products"); pstmt.setString(5, "US"); pstmt.setString(6, "desktop"); pstmt.setString(7, "session_xyz"); pstmt.executeUpdate(); } }}Deploying to Klutch.sh
Now that your ClickHouse project is ready and pushed to GitHub, follow these steps to deploy it on Klutch.sh with persistent storage for optimal performance.
Deployment Steps
-
Log in to Klutch.sh
Navigate to klutch.sh/app and sign in to your account.
-
Create a New Project
Go to Create Project and give your project a meaningful name (e.g., “ClickHouse Analytics Database”).
-
Create a New App
Navigate to Create App and configure the following settings:
-
Select Your Repository
- Choose GitHub as your Git source
- Select the repository containing your Dockerfile
- Choose the branch you want to deploy (usually
mainormaster)
-
Configure Traffic Type
- Traffic Type: Select TCP (ClickHouse requires TCP traffic for database connections)
- Internal Port: Set to
8123(the default ClickHouse HTTP interface port that your container listens on)
Note: While ClickHouse also uses port 9000 for native protocol, port 8123 (HTTP interface) is recommended as it’s more universally compatible and easier to work with for most applications.
-
Set Environment Variables
Add the following environment variables for your ClickHouse configuration:
CLICKHOUSE_DB: The name of your database (e.g.,defaultoranalytics)CLICKHOUSE_USER: The database username (e.g.,default)CLICKHOUSE_PASSWORD: A strong password for your database user (use a secure password generator)CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: (Optional) Set to1to enable SQL-based user management
Security Note: Always use strong, unique passwords for production databases. ClickHouse passwords should be at least 12 characters with mixed case, numbers, and special characters.
-
Attach a Persistent Volume
This is critical for ensuring your database data persists across deployments and restarts:
- In the Volumes section, click “Add Volume”
- Mount Path: Enter
/var/lib/clickhouse(this is where ClickHouse stores its data files, tables, and metadata) - Size: Choose an appropriate size based on your expected data volume (e.g., 20GB, 50GB, 100GB, etc.)
Important: ClickHouse requires persistent storage to maintain your data between container restarts. The storage requirements depend on your data volume, but plan for 3-10x compression of your raw data size.
-
Configure Additional Settings
- Region: Select the region closest to your users or data sources for optimal latency
- Compute Resources: Choose CPU and memory based on your workload
- Minimum: 1 vCPU, 2GB RAM (for development/testing)
- Recommended: 2+ vCPUs, 4GB+ RAM (for production workloads)
- Heavy Analytics: 4+ vCPUs, 8GB+ RAM (for high-throughput analytics)
- Instances: Start with 1 instance (ClickHouse can be scaled horizontally later with clustering)
-
Deploy Your Database
Click “Create” to start the deployment. Klutch.sh will:
- Automatically detect your Dockerfile in the repository root
- Build the Docker image with ClickHouse
- Attach the persistent volume
- Start your ClickHouse container
- Assign a URL for external connections
-
Access Your Database
Once deployment is complete, you’ll receive a URL like
example-app.klutch.sh. You can connect to your ClickHouse database using this URL on port 8000:HTTP Interface:
Terminal window curl -u default:ClickHouse123! \'http://example-app.klutch.sh:8000/?query=SELECT%20version()'Connection String:
clickhouse://default:ClickHouse123!@example-app.klutch.sh:8000/default
Working with ClickHouse
Creating Tables
ClickHouse offers several table engines optimized for different use cases. The most common is MergeTree:
-- Time-series events tableCREATE TABLE events ( timestamp DateTime, date Date DEFAULT toDate(timestamp), user_id UInt64, event_name String, properties String) ENGINE = MergeTree()PARTITION BY toYYYYMM(date)ORDER BY (date, user_id, timestamp)SETTINGS index_granularity = 8192;
-- Aggregated metrics table with SummingMergeTreeCREATE TABLE metrics ( date Date, metric_name String, value Float64) ENGINE = SummingMergeTree()PARTITION BY toYYYYMM(date)ORDER BY (date, metric_name);
-- Replicated table for high availabilityCREATE TABLE distributed_events ON CLUSTER my_cluster ( timestamp DateTime, user_id UInt64, event_name String) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/distributed_events', '{replica}')PARTITION BY toYYYYMM(toDate(timestamp))ORDER BY (toDate(timestamp), user_id, timestamp);Inserting Data
-- Insert single rowINSERT INTO events (timestamp, user_id, event_name, properties)VALUES (now(), 12345, 'page_view', '{"page": "/home"}');
-- Insert multiple rowsINSERT INTO events (timestamp, user_id, event_name, properties) VALUES (now(), 12345, 'page_view', '{"page": "/home"}'), (now(), 67890, 'click', '{"button": "signup"}'), (now(), 12345, 'page_view', '{"page": "/products"}');
-- Insert from file (CSV, TSV, JSON)-- Via HTTP:curl -u default:ClickHouse123! \ 'http://example-app.klutch.sh:8000/?query=INSERT%20INTO%20events%20FORMAT%20CSV' \ --data-binary @data.csvQuerying Data
ClickHouse SQL supports advanced analytics functions:
-- Basic aggregationSELECT toDate(timestamp) AS date, event_name, count() AS event_count, uniq(user_id) AS unique_usersFROM eventsWHERE date >= today() - 7GROUP BY date, event_nameORDER BY date DESC, event_count DESC;
-- Time-series analysis with window functionsSELECT toStartOfInterval(timestamp, INTERVAL 1 HOUR) AS hour, event_name, count() AS events, uniq(user_id) AS unique_users, avg(events) OVER (PARTITION BY event_name ORDER BY hour ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS moving_avgFROM eventsWHERE timestamp >= now() - INTERVAL 24 HOURGROUP BY hour, event_nameORDER BY hour DESC;
-- User retention cohort analysisSELECT toStartOfWeek(first_seen) AS cohort_week, dateDiff('week', cohort_week, toStartOfWeek(timestamp)) AS weeks_since_signup, uniq(user_id) AS usersFROM ( SELECT user_id, timestamp, min(timestamp) OVER (PARTITION BY user_id) AS first_seen FROM events)GROUP BY cohort_week, weeks_since_signupORDER BY cohort_week DESC, weeks_since_signup;
-- Top N query with LIMIT BYSELECT date, user_id, count() AS eventsFROM eventsWHERE date >= today() - 30GROUP BY date, user_idORDER BY date DESC, events DESCLIMIT 10 BY date;Materialized Views
Materialized views in ClickHouse automatically aggregate data as it’s inserted:
-- Create a materialized view for hourly statisticsCREATE MATERIALIZED VIEW hourly_statsENGINE = SummingMergeTree()PARTITION BY toYYYYMM(hour)ORDER BY (hour, event_name)AS SELECT toStartOfHour(timestamp) AS hour, event_name, count() AS event_count, uniq(user_id) AS unique_usersFROM eventsGROUP BY hour, event_name;
-- Query the materialized view (much faster than the base table)SELECT * FROM hourly_statsWHERE hour >= now() - INTERVAL 7 DAYORDER BY hour DESC;Production Best Practices
Security Recommendations
- Use Strong Passwords: Never use default passwords in production. Generate strong, random passwords with at least 12 characters.
- Environment Variables: Store sensitive credentials as environment variables in Klutch.sh, never in your Dockerfile or configuration files.
- User Management: Create separate users with limited permissions for different applications and use cases.
- Network Security: Consider using HTTPS/TLS for encrypted connections (requires custom configuration).
- Access Control: Implement row-level security and column-level permissions for sensitive data.
- Regular Security Updates: Keep your ClickHouse version updated with the latest security patches.
Performance Optimization
- Proper Table Design: Choose the right table engine (MergeTree, SummingMergeTree, AggregatingMergeTree) based on your use case.
- Partitioning Strategy: Partition tables by time periods (daily, monthly) for efficient data management and faster queries.
- Sorting Key: Choose an optimal
ORDER BYclause based on your most common query patterns. - Compression: ClickHouse automatically compresses data, but you can tune compression codecs for specific columns.
- Materialized Views: Pre-aggregate frequently queried data with materialized views to reduce query latency.
- Sampling: Use sampling for approximate queries on large datasets (
SAMPLE 0.1for 10% sample). - Dictionaries: Use dictionaries for dimension tables and lookups to improve join performance.
Data Management
-
Partitioning: Use time-based partitioning to efficiently manage data lifecycle:
-- Drop old partitions to free spaceALTER TABLE events DROP PARTITION '202301';-- Optimize tables regularly to merge partsOPTIMIZE TABLE events FINAL; -
TTL (Time To Live): Automatically delete old data:
CREATE TABLE events (timestamp DateTime,event_name String,user_id UInt64) ENGINE = MergeTree()PARTITION BY toYYYYMM(timestamp)ORDER BY (timestamp, user_id)TTL timestamp + INTERVAL 90 DAY; -- Delete data older than 90 days -
Backups: Implement regular backup strategy:
-- Backup tableBACKUP TABLE events TO Disk('backups', 'events_backup.zip');-- Restore tableRESTORE TABLE events FROM Disk('backups', 'events_backup.zip');
Resource Allocation
- Memory: Allocate sufficient memory for your workload. ClickHouse can use most available RAM for caching.
- CPU: More CPU cores improve query parallelization and overall throughput.
- Storage: Use fast SSD storage for best performance. Plan for compression (typically 3-10x).
- Connection Pooling: Use connection pooling in your applications to manage database connections efficiently.
Monitoring
Monitor your ClickHouse database for:
- Query Performance: Track slow queries using
system.query_logtable - Resource Usage: Monitor CPU, memory, disk I/O, and network usage
- Insert Rate: Monitor data ingestion rate and batch sizes
- Merge Activity: Track background merge operations
- Replication Lag: If using replication, monitor lag between replicas
- Disk Space: Monitor partition sizes and overall storage usage
-- View recent slow queriesSELECT query_duration_ms, query, user, read_rows, read_bytesFROM system.query_logWHERE event_time >= now() - INTERVAL 1 HOUR AND query_duration_ms > 1000ORDER BY query_duration_ms DESCLIMIT 10;
-- Check table sizesSELECT database, table, formatReadableSize(sum(bytes)) AS size, sum(rows) AS rows, max(modification_time) AS latest_modificationFROM system.partsWHERE activeGROUP BY database, tableORDER BY sum(bytes) DESC;Troubleshooting
Cannot Connect to Database
- Verify that you’re using the correct connection string with port 8000
- Ensure your environment variables (username/password) are set correctly in Klutch.sh
- Check that the internal port is set to 8123 in your app configuration
- Verify the ClickHouse container is running and healthy in the Klutch.sh dashboard
Database Not Persisting Data
- Verify that the persistent volume is correctly attached at
/var/lib/clickhouse - Check that the volume has sufficient space allocated
- Ensure the ClickHouse container has proper permissions to write to the volume
Performance Issues
- Review your table schema and ensure proper
ORDER BYclause for your query patterns - Consider creating materialized views for frequently aggregated queries
- Check if queries are using indexes effectively (use
EXPLAINstatement) - Monitor resource usage and consider increasing compute resources (CPU/memory)
- Optimize partition size (avoid too many small partitions or too few large ones)
Out of Memory Errors
- Increase the memory allocation for your ClickHouse container
- Reduce
max_memory_usageormax_bytes_before_external_group_bysettings - Optimize queries to process less data (use better filters, sampling, or materialized views)
- Add
LIMITclauses to queries that return too many rows
Slow Queries
- Use
EXPLAINto analyze query execution plan - Check if proper indexes and sorting keys are being used
- Consider creating materialized views for complex aggregations
- Optimize
WHEREclause conditions to filter data early - Use
PREWHEREclause for filtering before reading all columns
Disk Space Issues
- Drop old partitions that are no longer needed
- Implement TTL policies to automatically delete old data
- Run
OPTIMIZE TABLE ... FINALto merge small parts and reclaim space - Increase the persistent volume size in Klutch.sh if needed
Additional Resources
- Klutch.sh Documentation
- Official ClickHouse Documentation
- ClickHouse Docker Image Documentation
- Klutch.sh Volumes Guide
- Klutch.sh Networking Guide
- ClickHouse Interfaces Overview
- ClickHouse Performance Guide
Conclusion
Deploying ClickHouse to Klutch.sh with Docker provides a powerful, scalable analytics database solution with exceptional query performance and persistent storage. By following this guide, you’ve set up a production-ready ClickHouse database optimized for real-time analytics, with proper data persistence, security configurations, and connection capabilities. Your database is now ready to handle billions of rows and support high-performance analytical workloads for your applications, from real-time dashboards to complex data warehousing tasks.