Deploying Dremio

Introduction

Dremio is a powerful, open-source data lakehouse platform that transforms the way organizations query and analyze data. Built for modern data architectures, Dremio provides blazing-fast SQL queries directly on data lake storage (like S3, ADLS, HDFS) without the need for time-consuming ETL processes or data movement. It combines the flexibility of data lakes with the performance and simplicity of data warehouses.

Dremio is renowned for its:

Lightning-Fast Query Performance: Apache Arrow-based query engine delivers sub-second performance on massive datasets
Zero-Copy Architecture: Query data directly in place without moving or copying it to a proprietary format
Self-Service Data Access: Business users can explore and analyze data using familiar SQL and BI tools
Data Virtualization: Create a unified semantic layer across multiple data sources (data lakes, databases, warehouses)
Advanced Optimizations: Columnar Cloud Cache (C3), data reflections, and query acceleration for optimal performance
Open Standards: Built on Apache Arrow, Apache Iceberg, and open-source technologies for maximum interoperability
Enterprise Security: Fine-grained access controls, row-level security, and comprehensive audit logging

Common use cases include self-service analytics, data warehouse modernization, federated queries across multiple data sources, interactive BI dashboards, data science workloads, and building modern data lakehouse architectures.

This comprehensive guide walks you through deploying Dremio on Klutch.sh using Docker, including detailed installation steps, sample configurations, and production-ready best practices for persistent storage and optimal performance.

Prerequisites

Before you begin, ensure you have the following:

A Klutch.sh account
A GitHub account with a repository for your Dremio project
Docker installed locally for testing (optional but recommended)
Basic understanding of Docker and SQL concepts
(Optional) Access to data sources like S3, Azure Data Lake, or other databases you want to connect

Installation and Setup

Step 1: Create Your Project Directory

First, create a new directory for your Dremio deployment project:

mkdir dremio-klutch
cd dremio-klutch
git init

Step 2: Create the Dockerfile

Create a Dockerfile in your project root directory. This will define your Dremio container configuration:

FROM dremio/dremio-oss:latest

# Expose the web UI port (9047)
EXPOSE 9047

# Set memory allocation
# These can be adjusted based on your workload requirements
ENV DREMIO_MAX_HEAP_MEMORY_SIZE_MB=4096
ENV DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=8192

# Optional: Set additional Java options for performance tuning
ENV DREMIO_JAVA_SERVER_EXTRA_OPTS="-XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:MaxGCPauseMillis=500"

# The data directory where Dremio stores its metadata and local cache
# This should be mounted as a persistent volume
VOLUME ["/opt/dremio/data"]

Note: The dremio/dremio-oss image provides the open-source version of Dremio with all core features for data lakehouse analytics.

Step 3: (Optional) Create Custom Configuration

For advanced deployments, you can create a custom dremio.conf configuration file. Create a file named dremio.conf:

# dremio.conf - Custom Dremio Configuration

paths: {
  # Local path for Dremio to store internal data and metadata
  local: "/opt/dremio/data"

  # Distributed path for storing job results and downloads
  dist: "pdfs:///opt/dremio/data/pdfs"

  # Accelerator path for storing reflections (data acceleration structures)
  accelerator: "pdfs:///opt/dremio/data/accelerator"

  # Results path for query results
  results: "pdfs:///opt/dremio/data/results"

  # Scratch path for temporary files
  scratch: "pdfs:///opt/dremio/data/scratch"
}

services: {
  # Coordinator service configuration
  coordinator: {
    enabled: true,
    master: {
      enabled: true
    }
  }

  # Executor service configuration (for distributed deployments)
  executor: {
    enabled: true
  }
}

# Web server configuration
web: {
  port: 9047
  ssl: {
    enabled: false
  }
}

# Memory allocation (in MB)
memory: {
  heap: {
    max: 4096
  }

  direct: {
    max: 8192
  }
}

# Performance tuning options
options: {
  # Enable reflections for query acceleration
  "accelerator.enable": true

  # Maximum query planning time
  "planner.timeout_per_phase_ms": 30000

  # Enable C3 (Columnar Cloud Cache)
  "dremio.cache.enabled": true
}

If you create a custom configuration file, update your Dockerfile to include it:

FROM dremio/dremio-oss:latest

# Copy custom configuration
COPY ./dremio.conf /opt/dremio/conf/dremio.conf

EXPOSE 9047

ENV DREMIO_MAX_HEAP_MEMORY_SIZE_MB=4096
ENV DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=8192

VOLUME ["/opt/dremio/data"]

Step 4: Test Locally (Optional)

Before deploying to Klutch.sh, you can test your Dremio setup locally:

# Build the Docker image
docker build -t my-dremio .

# Run the container with port mapping
docker run -d \
  --name dremio-test \
  -p 9047:9047 \
  -v dremio-data:/opt/dremio/data \
  my-dremio

# Wait for Dremio to start (usually takes 30-60 seconds)
echo "Waiting for Dremio to start..."
sleep 45

# Check if Dremio is running
docker logs dremio-test

# Access the web UI at http://localhost:9047
echo "Dremio should now be accessible at http://localhost:9047"

# Stop and remove the test container when done
docker stop dremio-test
docker rm dremio-test
docker volume rm dremio-data

On first startup, you’ll be prompted to create an admin user account through the web interface at http://localhost:9047.

Step 5: Push to GitHub

Commit your Dockerfile and any configuration files to your GitHub repository:

git add Dockerfile
# If you created custom configuration files, add them too:
# git add dremio.conf
git commit -m "Add Dremio Dockerfile and configuration"
git remote add origin https://github.com/yourusername/dremio-klutch.git
git push -u origin main

Deploying to Klutch.sh

Now that your Dremio project is ready and pushed to GitHub, follow these steps to deploy it on Klutch.sh with persistent storage.

Deployment Steps

Log in to Klutch.sh

Navigate to klutch.sh/app and sign in to your account.
Create a New Project

Go to Create Project and give your project a meaningful name (e.g., “Dremio Data Lakehouse”).
Create a New App

Navigate to Create App and configure the following settings:
Select Your Repository
- Choose GitHub as your Git source
- Select the repository containing your Dockerfile
- Choose the branch you want to deploy (usually main or master)
Configure Traffic Type
- Traffic Type: Select HTTP (Dremio’s web UI uses HTTP for browser access)
- Internal Port: Set to 9047 (the default Dremio web interface port that your container listens on)
Set Environment Variables

Add the following environment variables to optimize Dremio performance:
- DREMIO_MAX_HEAP_MEMORY_SIZE_MB: Set to 4096 or higher based on your workload (e.g., 8192 for heavy usage)
- DREMIO_MAX_DIRECT_MEMORY_SIZE_MB: Set to 8192 or higher (should be 2x heap memory)
- DREMIO_JAVA_SERVER_EXTRA_OPTS: (Optional) Set to -XX:+UseG1GC -XX:G1HeapRegionSize=32M for G1 garbage collector
Note: Dremio is memory-intensive. Ensure you allocate sufficient memory based on your expected data volume and query complexity.
Attach a Persistent Volume

This is critical for ensuring your Dremio metadata, reflections, and cache persist across deployments:
- In the Volumes section, click “Add Volume”
- Mount Path: Enter /opt/dremio/data (this is where Dremio stores all persistent data)
- Size: Choose an appropriate size based on your expected data volume and reflections
  - Minimum: 20GB (for development/testing)
  - Recommended: 50GB-100GB (for production workloads)
  - Large Deployments: 200GB+ (for extensive reflections and caching)
Important: Dremio stores metadata, query results, reflections (accelerated data structures), and Columnar Cloud Cache in this directory. Without persistent storage, you’ll lose all configuration and acceleration data on restart.
Configure Additional Settings
- Region: Select the region closest to your data sources or users for optimal latency
- Compute Resources: Choose CPU and memory based on your workload
  - Minimum: 2 vCPUs, 8GB RAM (for development/testing)
  - Recommended: 4 vCPUs, 16GB RAM (for production workloads)
  - Heavy Analytics: 8+ vCPUs, 32GB+ RAM (for high-concurrency or complex queries)
- Instances: Start with 1 instance (Dremio can be scaled with multiple executors in distributed mode)
Deploy Dremio

Click “Create” to start the deployment. Klutch.sh will:
- Automatically detect your Dockerfile in the repository root
- Build the Docker image with Dremio
- Attach the persistent volume
- Start your Dremio container
- Assign a URL for accessing the web interface
Initial Setup and Admin Account

Once deployment is complete, you’ll receive a URL like example-app.klutch.sh. Navigate to this URL in your browser:
```
https://example-app.klutch.sh
```
On first access, you’ll be prompted to create an admin account:
- Username: Choose an admin username
- First Name: Your first name
- Last Name: Your last name
- Email: Your email address
- Password: Create a strong password
After creating the admin account, you’ll be logged into the Dremio web UI where you can start connecting data sources and running queries.

Connecting Data Sources

One of Dremio’s most powerful features is its ability to connect to multiple data sources and provide a unified query interface. Here’s how to add common data sources:

Amazon S3 / S3-Compatible Storage

In the Dremio UI, click “Add Source” in the left sidebar
Select “Amazon S3”
Configure the connection:
- Name: Give your source a meaningful name (e.g., “Production S3 Bucket”)
- AWS Access Key: Your AWS access key ID
- AWS Secret Key: Your AWS secret access key
- Encrypt connection: Enable for secure connections
- Root Path: The S3 bucket path (e.g., my-bucket/data/)
- Enable compatibility mode: Check if using S3-compatible storage (MinIO, Wasabi, etc.)
Click “Save” to add the data source

PostgreSQL

Click “Add Source” → “PostgreSQL”
Configure:
- Name: Source name (e.g., “Production PostgreSQL”)
- Host: Database hostname
- Port: 5432 (or your custom port)
- Database: Database name
- Username: Database username
- Password: Database password
- Encrypt connection: Enable for SSL connections
Click “Save”

MySQL / MariaDB

Click “Add Source” → “MySQL”
Configure similar to PostgreSQL:
- Name: Source name
- Host: Database hostname
- Port: 3306 (default MySQL port)
- Database: Database name
- Username: Database username
- Password: Database password
Click “Save”

MongoDB

Click “Add Source” → “MongoDB”
Configure:
- Name: Source name
- Connection String: MongoDB connection URI (e.g., mongodb://user:pass@host:27017)
- Authentication Database: Usually “admin”
- Use SSL: Enable for encrypted connections
Click “Save”

Azure Data Lake Storage (ADLS)

Click “Add Source” → “Azure Storage”
Configure:
- Name: Source name
- Account Name: Azure storage account name
- Access Key or OAuth 2.0: Choose authentication method
- Container: Container name
- Enable secure connection: Enable for HTTPS
Click “Save”

Working with Dremio

Running SQL Queries

After connecting data sources, you can start querying data:

Navigate to Data Sources: In the left sidebar, expand your connected data sources
Preview Data: Click on any table to see a sample of the data
Create a New Query: Click the “New Query” button in the top toolbar
Write SQL: Use standard SQL syntax to query your data

Example Queries:

-- Query data from S3 data lake
SELECT
    customer_id,
    product_name,
    SUM(order_amount) as total_spent,
    COUNT(*) as order_count
FROM S3.sales_data.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, product_name
ORDER BY total_spent DESC
LIMIT 100;

-- Join data across different sources (S3 + PostgreSQL)
SELECT
    o.order_id,
    o.order_date,
    o.total_amount,
    c.customer_name,
    c.email,
    c.region
FROM S3.sales_data.orders o
JOIN PostgreSQL.public.customers c ON o.customer_id = c.id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30' DAY
ORDER BY o.order_date DESC;

-- Aggregate analytics with window functions
SELECT
    date_trunc('day', event_timestamp) as date,
    user_id,
    event_type,
    COUNT(*) as event_count,
    SUM(COUNT(*)) OVER (PARTITION BY user_id ORDER BY date_trunc('day', event_timestamp)) as cumulative_events
FROM S3.analytics.user_events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY date_trunc('day', event_timestamp), user_id, event_type
ORDER BY date, user_id;

Creating Virtual Datasets (Views)

Virtual datasets are saved views that simplify complex queries and provide reusable data models:

Write a query in the SQL editor
Click “Save As” → “Virtual Dataset”
Choose a space (folder) and name for your dataset
Click “Save”

Now you can query this virtual dataset like a table:

SELECT * FROM MySpace.customer_orders_summary
WHERE total_spent > 1000;

Data Reflections (Query Acceleration)

Reflections are Dremio’s automatic query acceleration feature. They create optimized, materialized aggregates and raw data copies:

Navigate to a dataset you want to accelerate
Click the “Reflections” tab
Enable Raw Reflections (for fast access to raw data) or Aggregation Reflections (for pre-aggregated analytics)
Configure the reflection:
- Display Fields: Columns to include
- Partition By: Columns to partition data (improves query pruning)
- Sort By: Columns to sort data (improves query performance)
- Distribution: How to distribute data across nodes
Click “Save”

Dremio will automatically build and maintain these reflections, transparently accelerating matching queries by orders of magnitude.

Creating Spaces and Organizing Data

Spaces are folders for organizing your virtual datasets, views, and curated data:

Click ”+” next to “Spaces” in the left sidebar
Name your space (e.g., “Analytics”, “Sales Reports”, “Data Science”)
Click “Save”
Create folders within spaces for better organization

Connecting Applications to Dremio

Dremio supports multiple connection methods for integrating with applications and BI tools. When deployed on Klutch.sh, the primary access method is through the web UI, which Klutch.sh automatically routes to Dremio’s internal port 9047.

Web UI Access

The Dremio web interface is the primary way to interact with your deployment:

https://example-app.klutch.sh

Through the web UI, you can:

Execute SQL queries in the query editor
Connect and manage data sources
Create virtual datasets and views
Configure reflections for query acceleration
Manage users and permissions
Monitor query performance and system health

REST API

For programmatic access, use Dremio’s REST API through the same web UI endpoint:

Python Example:

import requests
import json

# Base URL (same as web UI)
base_url = 'https://example-app.klutch.sh'

# Authenticate and get token
auth_response = requests.post(
    f'{base_url}/apiv2/login',
    json={'userName': 'admin', 'password': 'your-password'}
)
token = auth_response.json()['token']

# Run a SQL query
query_response = requests.post(
    f'{base_url}/api/v3/sql',
    headers={'Authorization': f'_dremio{token}'},
    json={'sql': 'SELECT * FROM S3.sales_data.orders LIMIT 10'}
)

results = query_response.json()
print(json.dumps(results, indent=2))

# Get catalog information
catalog_response = requests.get(
    f'{base_url}/api/v3/catalog',
    headers={'Authorization': f'_dremio{token}'}
)
print(json.dumps(catalog_response.json(), indent=2))

JavaScript/Node.js Example:

const axios = require('axios');

const baseURL = 'https://example-app.klutch.sh';

async function queryDremio() {
  // Authenticate
  const authResponse = await axios.post(`${baseURL}/apiv2/login`, {
    userName: 'admin',
    password: 'your-password'
  });

  const token = authResponse.data.token;

  // Execute query
  const queryResponse = await axios.post(
    `${baseURL}/api/v3/sql`,
    { sql: 'SELECT * FROM S3.sales_data.orders LIMIT 10' },
    { headers: { 'Authorization': `_dremio${token}` } }
  );

  console.log(queryResponse.data);
}

queryDremio().catch(console.error);

JDBC Connection (Advanced)

For applications requiring JDBC connectivity, you can use the JDBC driver to connect through Dremio’s API endpoint:

Connection String:

jdbc:dremio:direct=example-app.klutch.sh;ssl=true

Note: JDBC connections work through Dremio’s web interface using HTTPS. The Klutch.sh platform automatically handles routing to the correct internal port.

Java Example:

import java.sql.*;

public class DremioExample {
    public static void main(String[] args) throws SQLException {
        // Connection via HTTPS endpoint
        String connectionUrl = "jdbc:dremio:direct=example-app.klutch.sh;ssl=true";
        String username = "admin";
        String password = "your-password";

        try (Connection conn = DriverManager.getConnection(connectionUrl, username, password)) {
            Statement stmt = conn.createStatement();
            ResultSet rs = stmt.executeQuery("SELECT * FROM S3.sales_data.orders LIMIT 10");

            while (rs.next()) {
                System.out.println("Order ID: " + rs.getInt("order_id"));
                System.out.println("Amount: " + rs.getDouble("total_amount"));
            }
        }
    }
}

ODBC Connection

For ODBC connections:

Download the Dremio ODBC driver from Dremio’s website
Install the driver on your machine
Configure a DSN (Data Source Name) with:
- Host: example-app.klutch.sh
- Connection Type: HTTP
- Schema: Your default space/source
- Authentication: Username/Password
- Use SSL: Yes (recommended for production)

BI Tool Connections

Dremio integrates seamlessly with popular BI tools through ODBC or REST API connections:

Tableau:

Download the Dremio connector from Tableau’s connector gallery
Connect using hostname example-app.klutch.sh
Use HTTPS connection for secure access

Power BI:

Use the Dremio ODBC driver configured as described above
Configure as an ODBC data source in Power BI
Select the DSN you created for Dremio

Apache Superset:

Use the Dremio database connector in Superset
Connection string format: dremio+https://admin:password@example-app.klutch.sh

Looker:

Use Looker’s Dremio connection type
Configure with your deployment URL and credentials

Metabase:

Add Dremio as a database using the Dremio JDBC driver
Use the JDBC connection string with HTTPS enabled

Production Best Practices

Security Recommendations

Strong Passwords: Use complex passwords with at least 12 characters including uppercase, lowercase, numbers, and symbols
HTTPS/SSL: For production, configure SSL certificates for encrypted connections
Role-Based Access Control (RBAC): Create roles with specific permissions and assign users appropriately
Row-Level Security: Implement row-level security policies to restrict data access based on user attributes
Column-Level Security: Hide sensitive columns from unauthorized users
OAuth Integration: Integrate with OAuth 2.0 providers (Okta, Azure AD, etc.) for centralized authentication
Regular Security Audits: Monitor the audit log for suspicious activities

Performance Optimization

Reflections Strategy: Create reflections for frequently queried datasets, especially large aggregations
Partition Pruning: Use partitioned data sources and leverage partition columns in WHERE clauses
C3 (Columnar Cloud Cache): Enable C3 for frequently accessed cloud data to reduce latency
Query Profiles: Use Dremio’s query profiler to identify bottlenecks and optimize slow queries
Memory Allocation: Allocate sufficient heap and direct memory based on workload (minimum 8GB heap, 16GB direct)
Data Format: Use columnar formats (Parquet, ORC) for optimal query performance
Statistics: Ensure statistics are up-to-date on your data sources for better query planning

Data Management

Organize with Spaces: Create logical spaces for different teams, projects, or use cases
Virtual Datasets: Build reusable semantic layers with business-friendly column names and calculations
Data Governance: Implement tagging, descriptions, and wiki pages for datasets
Version Control: Use Git integration to version control your virtual datasets and views
Lifecycle Management: Implement data retention policies and archive old data

Resource Allocation

Memory Guidelines:
- Development: 8GB heap, 16GB direct memory
- Production: 16GB+ heap, 32GB+ direct memory
- Heavy Workloads: 32GB+ heap, 64GB+ direct memory
CPU: More CPU cores improve parallelization (minimum 4 cores, recommended 8-16 cores)
Storage: Fast SSD storage for best performance with C3 and reflections
Network: Low-latency network connections to data sources

Monitoring

Monitor your Dremio deployment for:

Query Performance: Track slow queries and identify optimization opportunities
Reflection Coverage: Ensure reflections are accelerating your most important queries
Resource Usage: Monitor CPU, memory, and disk usage
User Activity: Track active users and query patterns
Data Source Connections: Monitor connection health to all data sources
Cache Hit Rates: Track C3 cache effectiveness

Access Metrics:

Navigate to Admin → Job History for query logs
Check Reflections tab for reflection build status and acceleration metrics
Use Support → Download Support Key for detailed diagnostics

Troubleshooting

Cannot Access Web Interface

Verify the deployment is running in the Klutch.sh dashboard
Check that the internal port is set to 9047
Ensure your browser is not blocking the connection
Check container logs for startup errors

Dremio Startup is Slow

Dremio can take 30-90 seconds to start, especially on first launch
Check container logs: Look for “Dremio is ready” message
Increase memory allocation if you see OutOfMemory errors
Ensure persistent volume has sufficient space

Query Performance Issues

Check Reflections: Verify reflections are enabled and covering your queries
Review Query Profile: Use the query profile to identify bottlenecks (full table scans, large joins)
Optimize Data Format: Convert data to Parquet or ORC format
Partition Data: Use partitioned datasets for better pruning
Increase Memory: Allocate more heap and direct memory if queries are running out of memory
Check Data Source Performance: Slow queries might indicate issues with the underlying data source

Connection Timeouts to Data Sources

Verify network connectivity to data sources
Check firewall rules and security groups
Validate credentials and permissions
Test connection using native tools (aws s3 ls, psql, etc.)
Increase timeout settings in data source configuration

Reflection Failures

Check available disk space in persistent volume
Verify sufficient memory allocation
Review reflection settings (ensure columns exist)
Check data source for schema changes
Look at reflection job errors in Admin → Jobs

Memory Issues

Increase DREMIO_MAX_HEAP_MEMORY_SIZE_MB and DREMIO_MAX_DIRECT_MEMORY_SIZE_MB
Upgrade to a plan with more RAM
Reduce query complexity or result set size
Enable spilling to disk for large sorts and aggregations
Optimize reflections to reduce memory footprint

Lost Configuration After Restart

Verify persistent volume is correctly attached at /opt/dremio/data
Check volume has sufficient space
Ensure the container has write permissions to the volume
Review container logs for volume mount errors

Additional Resources

Conclusion

Deploying Dremio to Klutch.sh with Docker provides a powerful, scalable data lakehouse platform for modern analytics. By following this guide, you’ve set up a production-ready Dremio instance with persistent storage, optimized for high-performance SQL queries across multiple data sources. Your data lakehouse is now ready to support self-service analytics, unified data access, and lightning-fast queries without the complexity of traditional data warehouses or time-consuming ETL processes. With Dremio’s zero-copy architecture and advanced optimization features, you can democratize data access across your organization while maintaining enterprise-grade performance and security.