Skip to content

Deploying Dremio

Introduction

Dremio is a powerful, open-source data lakehouse platform that transforms the way organizations query and analyze data. Built for modern data architectures, Dremio provides blazing-fast SQL queries directly on data lake storage (like S3, ADLS, HDFS) without the need for time-consuming ETL processes or data movement. It combines the flexibility of data lakes with the performance and simplicity of data warehouses.

Dremio is renowned for its:

  • Lightning-Fast Query Performance: Apache Arrow-based query engine delivers sub-second performance on massive datasets
  • Zero-Copy Architecture: Query data directly in place without moving or copying it to a proprietary format
  • Self-Service Data Access: Business users can explore and analyze data using familiar SQL and BI tools
  • Data Virtualization: Create a unified semantic layer across multiple data sources (data lakes, databases, warehouses)
  • Advanced Optimizations: Columnar Cloud Cache (C3), data reflections, and query acceleration for optimal performance
  • Open Standards: Built on Apache Arrow, Apache Iceberg, and open-source technologies for maximum interoperability
  • Enterprise Security: Fine-grained access controls, row-level security, and comprehensive audit logging

Common use cases include self-service analytics, data warehouse modernization, federated queries across multiple data sources, interactive BI dashboards, data science workloads, and building modern data lakehouse architectures.

This comprehensive guide walks you through deploying Dremio on Klutch.sh using Docker, including detailed installation steps, sample configurations, and production-ready best practices for persistent storage and optimal performance.

Prerequisites

Before you begin, ensure you have the following:

  • A Klutch.sh account
  • A GitHub account with a repository for your Dremio project
  • Docker installed locally for testing (optional but recommended)
  • Basic understanding of Docker and SQL concepts
  • (Optional) Access to data sources like S3, Azure Data Lake, or other databases you want to connect

Installation and Setup

Step 1: Create Your Project Directory

First, create a new directory for your Dremio deployment project:

Terminal window
mkdir dremio-klutch
cd dremio-klutch
git init

Step 2: Create the Dockerfile

Create a Dockerfile in your project root directory. This will define your Dremio container configuration:

FROM dremio/dremio-oss:latest
# Expose the web UI port (9047)
EXPOSE 9047
# Set memory allocation
# These can be adjusted based on your workload requirements
ENV DREMIO_MAX_HEAP_MEMORY_SIZE_MB=4096
ENV DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=8192
# Optional: Set additional Java options for performance tuning
ENV DREMIO_JAVA_SERVER_EXTRA_OPTS="-XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:MaxGCPauseMillis=500"
# The data directory where Dremio stores its metadata and local cache
# This should be mounted as a persistent volume
VOLUME ["/opt/dremio/data"]

Note: The dremio/dremio-oss image provides the open-source version of Dremio with all core features for data lakehouse analytics.

Step 3: (Optional) Create Custom Configuration

For advanced deployments, you can create a custom dremio.conf configuration file. Create a file named dremio.conf:

# dremio.conf - Custom Dremio Configuration
paths: {
# Local path for Dremio to store internal data and metadata
local: "/opt/dremio/data"
# Distributed path for storing job results and downloads
dist: "pdfs:///opt/dremio/data/pdfs"
# Accelerator path for storing reflections (data acceleration structures)
accelerator: "pdfs:///opt/dremio/data/accelerator"
# Results path for query results
results: "pdfs:///opt/dremio/data/results"
# Scratch path for temporary files
scratch: "pdfs:///opt/dremio/data/scratch"
}
services: {
# Coordinator service configuration
coordinator: {
enabled: true,
master: {
enabled: true
}
}
# Executor service configuration (for distributed deployments)
executor: {
enabled: true
}
}
# Web server configuration
web: {
port: 9047
ssl: {
enabled: false
}
}
# Memory allocation (in MB)
memory: {
heap: {
max: 4096
}
direct: {
max: 8192
}
}
# Performance tuning options
options: {
# Enable reflections for query acceleration
"accelerator.enable": true
# Maximum query planning time
"planner.timeout_per_phase_ms": 30000
# Enable C3 (Columnar Cloud Cache)
"dremio.cache.enabled": true
}

If you create a custom configuration file, update your Dockerfile to include it:

FROM dremio/dremio-oss:latest
# Copy custom configuration
COPY ./dremio.conf /opt/dremio/conf/dremio.conf
EXPOSE 9047
ENV DREMIO_MAX_HEAP_MEMORY_SIZE_MB=4096
ENV DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=8192
VOLUME ["/opt/dremio/data"]

Step 4: Test Locally (Optional)

Before deploying to Klutch.sh, you can test your Dremio setup locally:

Terminal window
# Build the Docker image
docker build -t my-dremio .
# Run the container with port mapping
docker run -d \
--name dremio-test \
-p 9047:9047 \
-v dremio-data:/opt/dremio/data \
my-dremio
# Wait for Dremio to start (usually takes 30-60 seconds)
echo "Waiting for Dremio to start..."
sleep 45
# Check if Dremio is running
docker logs dremio-test
# Access the web UI at http://localhost:9047
echo "Dremio should now be accessible at http://localhost:9047"
# Stop and remove the test container when done
docker stop dremio-test
docker rm dremio-test
docker volume rm dremio-data

On first startup, you’ll be prompted to create an admin user account through the web interface at http://localhost:9047.

Step 5: Push to GitHub

Commit your Dockerfile and any configuration files to your GitHub repository:

Terminal window
git add Dockerfile
# If you created custom configuration files, add them too:
# git add dremio.conf
git commit -m "Add Dremio Dockerfile and configuration"
git remote add origin https://github.com/yourusername/dremio-klutch.git
git push -u origin main

Deploying to Klutch.sh

Now that your Dremio project is ready and pushed to GitHub, follow these steps to deploy it on Klutch.sh with persistent storage.

Deployment Steps

    1. Log in to Klutch.sh

      Navigate to klutch.sh/app and sign in to your account.

    2. Create a New Project

      Go to Create Project and give your project a meaningful name (e.g., “Dremio Data Lakehouse”).

    3. Create a New App

      Navigate to Create App and configure the following settings:

    4. Select Your Repository

      • Choose GitHub as your Git source
      • Select the repository containing your Dockerfile
      • Choose the branch you want to deploy (usually main or master)
    5. Configure Traffic Type

      • Traffic Type: Select HTTP (Dremio’s web UI uses HTTP for browser access)
      • Internal Port: Set to 9047 (the default Dremio web interface port that your container listens on)
    6. Set Environment Variables

      Add the following environment variables to optimize Dremio performance:

      • DREMIO_MAX_HEAP_MEMORY_SIZE_MB: Set to 4096 or higher based on your workload (e.g., 8192 for heavy usage)
      • DREMIO_MAX_DIRECT_MEMORY_SIZE_MB: Set to 8192 or higher (should be 2x heap memory)
      • DREMIO_JAVA_SERVER_EXTRA_OPTS: (Optional) Set to -XX:+UseG1GC -XX:G1HeapRegionSize=32M for G1 garbage collector

      Note: Dremio is memory-intensive. Ensure you allocate sufficient memory based on your expected data volume and query complexity.

    7. Attach a Persistent Volume

      This is critical for ensuring your Dremio metadata, reflections, and cache persist across deployments:

      • In the Volumes section, click “Add Volume”
      • Mount Path: Enter /opt/dremio/data (this is where Dremio stores all persistent data)
      • Size: Choose an appropriate size based on your expected data volume and reflections
        • Minimum: 20GB (for development/testing)
        • Recommended: 50GB-100GB (for production workloads)
        • Large Deployments: 200GB+ (for extensive reflections and caching)

      Important: Dremio stores metadata, query results, reflections (accelerated data structures), and Columnar Cloud Cache in this directory. Without persistent storage, you’ll lose all configuration and acceleration data on restart.

    8. Configure Additional Settings

      • Region: Select the region closest to your data sources or users for optimal latency
      • Compute Resources: Choose CPU and memory based on your workload
        • Minimum: 2 vCPUs, 8GB RAM (for development/testing)
        • Recommended: 4 vCPUs, 16GB RAM (for production workloads)
        • Heavy Analytics: 8+ vCPUs, 32GB+ RAM (for high-concurrency or complex queries)
      • Instances: Start with 1 instance (Dremio can be scaled with multiple executors in distributed mode)
    9. Deploy Dremio

      Click “Create” to start the deployment. Klutch.sh will:

      • Automatically detect your Dockerfile in the repository root
      • Build the Docker image with Dremio
      • Attach the persistent volume
      • Start your Dremio container
      • Assign a URL for accessing the web interface
    10. Initial Setup and Admin Account

      Once deployment is complete, you’ll receive a URL like example-app.klutch.sh. Navigate to this URL in your browser:

      https://example-app.klutch.sh

      On first access, you’ll be prompted to create an admin account:

      • Username: Choose an admin username
      • First Name: Your first name
      • Last Name: Your last name
      • Email: Your email address
      • Password: Create a strong password

      After creating the admin account, you’ll be logged into the Dremio web UI where you can start connecting data sources and running queries.


Connecting Data Sources

One of Dremio’s most powerful features is its ability to connect to multiple data sources and provide a unified query interface. Here’s how to add common data sources:

Amazon S3 / S3-Compatible Storage

  1. In the Dremio UI, click “Add Source” in the left sidebar

  2. Select “Amazon S3”

  3. Configure the connection:

    • Name: Give your source a meaningful name (e.g., “Production S3 Bucket”)
    • AWS Access Key: Your AWS access key ID
    • AWS Secret Key: Your AWS secret access key
    • Encrypt connection: Enable for secure connections
    • Root Path: The S3 bucket path (e.g., my-bucket/data/)
    • Enable compatibility mode: Check if using S3-compatible storage (MinIO, Wasabi, etc.)
  4. Click “Save” to add the data source

PostgreSQL

  1. Click “Add Source”“PostgreSQL”

  2. Configure:

    • Name: Source name (e.g., “Production PostgreSQL”)
    • Host: Database hostname
    • Port: 5432 (or your custom port)
    • Database: Database name
    • Username: Database username
    • Password: Database password
    • Encrypt connection: Enable for SSL connections
  3. Click “Save”

MySQL / MariaDB

  1. Click “Add Source”“MySQL”

  2. Configure similar to PostgreSQL:

    • Name: Source name
    • Host: Database hostname
    • Port: 3306 (default MySQL port)
    • Database: Database name
    • Username: Database username
    • Password: Database password
  3. Click “Save”

MongoDB

  1. Click “Add Source”“MongoDB”

  2. Configure:

    • Name: Source name
    • Connection String: MongoDB connection URI (e.g., mongodb://user:pass@host:27017)
    • Authentication Database: Usually “admin”
    • Use SSL: Enable for encrypted connections
  3. Click “Save”

Azure Data Lake Storage (ADLS)

  1. Click “Add Source”“Azure Storage”

  2. Configure:

    • Name: Source name
    • Account Name: Azure storage account name
    • Access Key or OAuth 2.0: Choose authentication method
    • Container: Container name
    • Enable secure connection: Enable for HTTPS
  3. Click “Save”


Working with Dremio

Running SQL Queries

After connecting data sources, you can start querying data:

  1. Navigate to Data Sources: In the left sidebar, expand your connected data sources
  2. Preview Data: Click on any table to see a sample of the data
  3. Create a New Query: Click the “New Query” button in the top toolbar
  4. Write SQL: Use standard SQL syntax to query your data

Example Queries:

-- Query data from S3 data lake
SELECT
customer_id,
product_name,
SUM(order_amount) as total_spent,
COUNT(*) as order_count
FROM S3.sales_data.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, product_name
ORDER BY total_spent DESC
LIMIT 100;
-- Join data across different sources (S3 + PostgreSQL)
SELECT
o.order_id,
o.order_date,
o.total_amount,
c.customer_name,
c.email,
c.region
FROM S3.sales_data.orders o
JOIN PostgreSQL.public.customers c ON o.customer_id = c.id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30' DAY
ORDER BY o.order_date DESC;
-- Aggregate analytics with window functions
SELECT
date_trunc('day', event_timestamp) as date,
user_id,
event_type,
COUNT(*) as event_count,
SUM(COUNT(*)) OVER (PARTITION BY user_id ORDER BY date_trunc('day', event_timestamp)) as cumulative_events
FROM S3.analytics.user_events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY date_trunc('day', event_timestamp), user_id, event_type
ORDER BY date, user_id;

Creating Virtual Datasets (Views)

Virtual datasets are saved views that simplify complex queries and provide reusable data models:

  1. Write a query in the SQL editor
  2. Click “Save As”“Virtual Dataset”
  3. Choose a space (folder) and name for your dataset
  4. Click “Save”

Now you can query this virtual dataset like a table:

SELECT * FROM MySpace.customer_orders_summary
WHERE total_spent > 1000;

Data Reflections (Query Acceleration)

Reflections are Dremio’s automatic query acceleration feature. They create optimized, materialized aggregates and raw data copies:

  1. Navigate to a dataset you want to accelerate

  2. Click the “Reflections” tab

  3. Enable Raw Reflections (for fast access to raw data) or Aggregation Reflections (for pre-aggregated analytics)

  4. Configure the reflection:

    • Display Fields: Columns to include
    • Partition By: Columns to partition data (improves query pruning)
    • Sort By: Columns to sort data (improves query performance)
    • Distribution: How to distribute data across nodes
  5. Click “Save”

Dremio will automatically build and maintain these reflections, transparently accelerating matching queries by orders of magnitude.

Creating Spaces and Organizing Data

Spaces are folders for organizing your virtual datasets, views, and curated data:

  1. Click ”+” next to “Spaces” in the left sidebar
  2. Name your space (e.g., “Analytics”, “Sales Reports”, “Data Science”)
  3. Click “Save”
  4. Create folders within spaces for better organization

Connecting Applications to Dremio

Dremio supports multiple connection methods for integrating with applications and BI tools. When deployed on Klutch.sh, the primary access method is through the web UI, which Klutch.sh automatically routes to Dremio’s internal port 9047.

Web UI Access

The Dremio web interface is the primary way to interact with your deployment:

https://example-app.klutch.sh

Through the web UI, you can:

  • Execute SQL queries in the query editor
  • Connect and manage data sources
  • Create virtual datasets and views
  • Configure reflections for query acceleration
  • Manage users and permissions
  • Monitor query performance and system health

REST API

For programmatic access, use Dremio’s REST API through the same web UI endpoint:

Python Example:

import requests
import json
# Base URL (same as web UI)
base_url = 'https://example-app.klutch.sh'
# Authenticate and get token
auth_response = requests.post(
f'{base_url}/apiv2/login',
json={'userName': 'admin', 'password': 'your-password'}
)
token = auth_response.json()['token']
# Run a SQL query
query_response = requests.post(
f'{base_url}/api/v3/sql',
headers={'Authorization': f'_dremio{token}'},
json={'sql': 'SELECT * FROM S3.sales_data.orders LIMIT 10'}
)
results = query_response.json()
print(json.dumps(results, indent=2))
# Get catalog information
catalog_response = requests.get(
f'{base_url}/api/v3/catalog',
headers={'Authorization': f'_dremio{token}'}
)
print(json.dumps(catalog_response.json(), indent=2))

JavaScript/Node.js Example:

const axios = require('axios');
const baseURL = 'https://example-app.klutch.sh';
async function queryDremio() {
// Authenticate
const authResponse = await axios.post(`${baseURL}/apiv2/login`, {
userName: 'admin',
password: 'your-password'
});
const token = authResponse.data.token;
// Execute query
const queryResponse = await axios.post(
`${baseURL}/api/v3/sql`,
{ sql: 'SELECT * FROM S3.sales_data.orders LIMIT 10' },
{ headers: { 'Authorization': `_dremio${token}` } }
);
console.log(queryResponse.data);
}
queryDremio().catch(console.error);

JDBC Connection (Advanced)

For applications requiring JDBC connectivity, you can use the JDBC driver to connect through Dremio’s API endpoint:

Connection String:

jdbc:dremio:direct=example-app.klutch.sh;ssl=true

Note: JDBC connections work through Dremio’s web interface using HTTPS. The Klutch.sh platform automatically handles routing to the correct internal port.

Java Example:

import java.sql.*;
public class DremioExample {
public static void main(String[] args) throws SQLException {
// Connection via HTTPS endpoint
String connectionUrl = "jdbc:dremio:direct=example-app.klutch.sh;ssl=true";
String username = "admin";
String password = "your-password";
try (Connection conn = DriverManager.getConnection(connectionUrl, username, password)) {
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM S3.sales_data.orders LIMIT 10");
while (rs.next()) {
System.out.println("Order ID: " + rs.getInt("order_id"));
System.out.println("Amount: " + rs.getDouble("total_amount"));
}
}
}
}

ODBC Connection

For ODBC connections:

  1. Download the Dremio ODBC driver from Dremio’s website
  2. Install the driver on your machine
  3. Configure a DSN (Data Source Name) with:
    • Host: example-app.klutch.sh
    • Connection Type: HTTP
    • Schema: Your default space/source
    • Authentication: Username/Password
    • Use SSL: Yes (recommended for production)

BI Tool Connections

Dremio integrates seamlessly with popular BI tools through ODBC or REST API connections:

Tableau:

  • Download the Dremio connector from Tableau’s connector gallery
  • Connect using hostname example-app.klutch.sh
  • Use HTTPS connection for secure access

Power BI:

  • Use the Dremio ODBC driver configured as described above
  • Configure as an ODBC data source in Power BI
  • Select the DSN you created for Dremio

Apache Superset:

  • Use the Dremio database connector in Superset
  • Connection string format: dremio+https://admin:password@example-app.klutch.sh

Looker:

  • Use Looker’s Dremio connection type
  • Configure with your deployment URL and credentials

Metabase:

  • Add Dremio as a database using the Dremio JDBC driver
  • Use the JDBC connection string with HTTPS enabled

Production Best Practices

Security Recommendations

  • Strong Passwords: Use complex passwords with at least 12 characters including uppercase, lowercase, numbers, and symbols
  • HTTPS/SSL: For production, configure SSL certificates for encrypted connections
  • Role-Based Access Control (RBAC): Create roles with specific permissions and assign users appropriately
  • Row-Level Security: Implement row-level security policies to restrict data access based on user attributes
  • Column-Level Security: Hide sensitive columns from unauthorized users
  • OAuth Integration: Integrate with OAuth 2.0 providers (Okta, Azure AD, etc.) for centralized authentication
  • Regular Security Audits: Monitor the audit log for suspicious activities

Performance Optimization

  • Reflections Strategy: Create reflections for frequently queried datasets, especially large aggregations
  • Partition Pruning: Use partitioned data sources and leverage partition columns in WHERE clauses
  • C3 (Columnar Cloud Cache): Enable C3 for frequently accessed cloud data to reduce latency
  • Query Profiles: Use Dremio’s query profiler to identify bottlenecks and optimize slow queries
  • Memory Allocation: Allocate sufficient heap and direct memory based on workload (minimum 8GB heap, 16GB direct)
  • Data Format: Use columnar formats (Parquet, ORC) for optimal query performance
  • Statistics: Ensure statistics are up-to-date on your data sources for better query planning

Data Management

  • Organize with Spaces: Create logical spaces for different teams, projects, or use cases
  • Virtual Datasets: Build reusable semantic layers with business-friendly column names and calculations
  • Data Governance: Implement tagging, descriptions, and wiki pages for datasets
  • Version Control: Use Git integration to version control your virtual datasets and views
  • Lifecycle Management: Implement data retention policies and archive old data

Resource Allocation

  • Memory Guidelines:
    • Development: 8GB heap, 16GB direct memory
    • Production: 16GB+ heap, 32GB+ direct memory
    • Heavy Workloads: 32GB+ heap, 64GB+ direct memory
  • CPU: More CPU cores improve parallelization (minimum 4 cores, recommended 8-16 cores)
  • Storage: Fast SSD storage for best performance with C3 and reflections
  • Network: Low-latency network connections to data sources

Monitoring

Monitor your Dremio deployment for:

  • Query Performance: Track slow queries and identify optimization opportunities
  • Reflection Coverage: Ensure reflections are accelerating your most important queries
  • Resource Usage: Monitor CPU, memory, and disk usage
  • User Activity: Track active users and query patterns
  • Data Source Connections: Monitor connection health to all data sources
  • Cache Hit Rates: Track C3 cache effectiveness

Access Metrics:

  1. Navigate to AdminJob History for query logs
  2. Check Reflections tab for reflection build status and acceleration metrics
  3. Use SupportDownload Support Key for detailed diagnostics

Troubleshooting

Cannot Access Web Interface

  • Verify the deployment is running in the Klutch.sh dashboard
  • Check that the internal port is set to 9047
  • Ensure your browser is not blocking the connection
  • Check container logs for startup errors

Dremio Startup is Slow

  • Dremio can take 30-90 seconds to start, especially on first launch
  • Check container logs: Look for “Dremio is ready” message
  • Increase memory allocation if you see OutOfMemory errors
  • Ensure persistent volume has sufficient space

Query Performance Issues

  • Check Reflections: Verify reflections are enabled and covering your queries
  • Review Query Profile: Use the query profile to identify bottlenecks (full table scans, large joins)
  • Optimize Data Format: Convert data to Parquet or ORC format
  • Partition Data: Use partitioned datasets for better pruning
  • Increase Memory: Allocate more heap and direct memory if queries are running out of memory
  • Check Data Source Performance: Slow queries might indicate issues with the underlying data source

Connection Timeouts to Data Sources

  • Verify network connectivity to data sources
  • Check firewall rules and security groups
  • Validate credentials and permissions
  • Test connection using native tools (aws s3 ls, psql, etc.)
  • Increase timeout settings in data source configuration

Reflection Failures

  • Check available disk space in persistent volume
  • Verify sufficient memory allocation
  • Review reflection settings (ensure columns exist)
  • Check data source for schema changes
  • Look at reflection job errors in Admin → Jobs

Memory Issues

  • Increase DREMIO_MAX_HEAP_MEMORY_SIZE_MB and DREMIO_MAX_DIRECT_MEMORY_SIZE_MB
  • Upgrade to a plan with more RAM
  • Reduce query complexity or result set size
  • Enable spilling to disk for large sorts and aggregations
  • Optimize reflections to reduce memory footprint

Lost Configuration After Restart

  • Verify persistent volume is correctly attached at /opt/dremio/data
  • Check volume has sufficient space
  • Ensure the container has write permissions to the volume
  • Review container logs for volume mount errors

Additional Resources


Conclusion

Deploying Dremio to Klutch.sh with Docker provides a powerful, scalable data lakehouse platform for modern analytics. By following this guide, you’ve set up a production-ready Dremio instance with persistent storage, optimized for high-performance SQL queries across multiple data sources. Your data lakehouse is now ready to support self-service analytics, unified data access, and lightning-fast queries without the complexity of traditional data warehouses or time-consuming ETL processes. With Dremio’s zero-copy architecture and advanced optimization features, you can democratize data access across your organization while maintaining enterprise-grade performance and security.