Deploying an Aleph App
Introduction
Aleph is an open-source document management and investigation platform designed for journalists, researchers, and investigators to collect, analyze, and share large collections of documents. Built with Python and Django, Aleph provides powerful search capabilities, entity extraction, document processing, and collaborative investigation tools.
Aleph is renowned for its:
- Document Processing: Advanced document parsing and text extraction from various file formats
- Entity Extraction: Automatic extraction of people, organizations, locations, and other entities from documents
- Full-Text Search: Powerful search engine with faceted search and filtering capabilities
- Investigation Tools: Collaborative features for teams working on investigations
- Data Visualization: Interactive visualizations of relationships between entities
- Multi-Format Support: Support for PDFs, images, emails, spreadsheets, and more
- Access Control: Fine-grained permissions and access control for sensitive investigations
- API Access: RESTful API for programmatic access and integration
- Scalability: Designed to handle large document collections efficiently
- Open Source: Fully open-source with active community development
Common use cases include investigative journalism, research projects, document archives, compliance investigations, legal case management, and collaborative data analysis.
This comprehensive guide walks you through deploying Aleph on Klutch.sh using a Dockerfile, including detailed installation steps, PostgreSQL database configuration, persistent storage setup, and production-ready best practices for hosting a document management and investigation platform.
Prerequisites
Before you begin, ensure you have the following:
- A Klutch.sh account
- A GitHub account with a repository for your Aleph project
- A PostgreSQL database (can be deployed separately on Klutch.sh or use an external database)
- Docker installed locally for testing (optional but recommended)
- Basic understanding of Python, Django, and document management systems
- Sufficient storage capacity for document collections (plan accordingly)
Installation and Setup
Step 1: Create Your Project Directory
First, create a new directory for your Aleph deployment project:
mkdir aleph-klutchcd aleph-klutchgit initStep 2: Clone or Prepare Aleph Source
You can either clone the official Aleph repository or prepare your own Aleph-based application:
# Option 1: Clone the official Aleph repositorygit clone https://github.com/alephdata/aleph.gitcd aleph
# Option 2: If you have your own Aleph fork or custom instance# Copy your Aleph source code to the project directoryStep 3: Create the Dockerfile
Create a Dockerfile in your project root directory. This will define your Aleph container configuration:
FROM python:3.11-slim
# Install system dependenciesRUN apt-get update && apt-get install -y \ build-essential \ libpq-dev \ libffi-dev \ libssl-dev \ libmagic1 \ poppler-utils \ tesseract-ocr \ libreoffice \ imagemagick \ ffmpeg \ git \ curl \ && rm -rf /var/lib/apt/lists/*
# Set working directoryWORKDIR /app
# Set environment variablesENV PYTHONUNBUFFERED=1 \ PYTHONDONTWRITEBYTECODE=1 \ PIP_NO_CACHE_DIR=1 \ PIP_DISABLE_PIP_VERSION_CHECK=1
# Copy requirements fileCOPY requirements.txt* ./
# Install Python dependenciesRUN pip install --upgrade pip && \ if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
# Copy application codeCOPY . .
# Create directories for persistent dataRUN mkdir -p /var/lib/aleph/data \ /var/lib/aleph/archive \ /var/lib/aleph/temp \ /var/lib/aleph/logs && \ chmod -R 755 /var/lib/aleph
# Collect static files (if using Django)RUN python manage.py collectstatic --noinput || true
# Expose portEXPOSE 8080
# Health checkHEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ CMD curl -f http://localhost:8080/api/2/status || exit 1
# Start the applicationCMD ["python", "manage.py", "runserver", "0.0.0.0:8080"]Note: This Dockerfile uses Python 3.11 with all necessary system dependencies for document processing. Aleph runs on port 8080 by default, which will be your internal port in Klutch.sh. The application requires various tools like Tesseract OCR, LibreOffice, and ImageMagick for document processing.
Step 4: Create Requirements File
Create a requirements.txt file with Aleph’s Python dependencies:
# Core dependenciesDjango>=4.2,<5.0django-environ>=0.11.0psycopg2-binary>=2.9.0celery>=5.3.0redis>=5.0.0
# Document processingpython-magic>=0.4.27pdfminer.six>=20221105Pillow>=10.0.0openpyxl>=3.1.0python-docx>=1.0.0
# Search and indexingelasticsearch>=8.0.0whoosh>=2.7.4
# Entity extractionspacy>=3.6.0nltk>=3.8.0
# API and webdjangorestframework>=3.14.0django-cors-headers>=4.2.0gunicorn>=21.2.0
# Utilitiespython-dateutil>=2.8.2requests>=2.31.0Step 5: Create Configuration Files
Create a production configuration file. Here’s a basic example:
aleph/settings/production.py.example:
# Copy this to aleph/settings/production.py and fill in your values# DO NOT commit production.py to version control
import osimport environ
env = environ.Env( DEBUG=(bool, False))
# Build pathsBASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Security settingsSECRET_KEY = env('SECRET_KEY', default='change-this-in-production')DEBUG = env('DEBUG', default=False)ALLOWED_HOSTS = env.list('ALLOWED_HOSTS', default=['example-app.klutch.sh'])
# Database configurationDATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql', 'NAME': env('DB_NAME', default='aleph'), 'USER': env('DB_USER', default='aleph'), 'PASSWORD': env('DB_PASSWORD', default=''), 'HOST': env('DB_HOST', default='localhost'), 'PORT': env('DB_PORT', default='5432'), 'OPTIONS': { 'sslmode': env('DB_SSL', default='prefer'), }, }}
# Static filesSTATIC_URL = '/static/'STATIC_ROOT = '/var/lib/aleph/static'
# Media filesMEDIA_URL = '/media/'MEDIA_ROOT = '/var/lib/aleph/data'
# Archive storageARCHIVE_ROOT = '/var/lib/aleph/archive'ARCHIVE_TYPE = 'file'
# Celery configurationCELERY_BROKER_URL = env('CELERY_BROKER_URL', default='redis://localhost:6379/0')CELERY_RESULT_BACKEND = env('CELERY_RESULT_BACKEND', default='redis://localhost:6379/0')
# Elasticsearch configurationELASTICSEARCH_URL = env('ELASTICSEARCH_URL', default='http://localhost:9200')ELASTICSEARCH_INDEX = env('ELASTICSEARCH_INDEX', default='aleph')
# Application settingsALEPH_APP_TITLE = env('ALEPH_APP_TITLE', default='Aleph')ALEPH_APP_LOGO = env('ALEPH_APP_LOGO', default='')ALEPH_UI_URL = env('ALEPH_UI_URL', default='https://example-app.klutch.sh')
# Email configurationEMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend'EMAIL_HOST = env('SMTP_HOST', default='')EMAIL_PORT = env.int('SMTP_PORT', default=587)EMAIL_USE_TLS = env.bool('SMTP_USE_TLS', default=True)EMAIL_HOST_USER = env('SMTP_USER', default='')EMAIL_HOST_PASSWORD = env('SMTP_PASSWORD', default='')DEFAULT_FROM_EMAIL = env('DEFAULT_FROM_EMAIL', default='noreply@example.com')
# LoggingLOGGING = { 'version': 1, 'disable_existing_loggers': False, 'handlers': { 'file': { 'level': 'INFO', 'class': 'logging.FileHandler', 'filename': '/var/lib/aleph/logs/aleph.log', }, }, 'root': { 'handlers': ['file'], 'level': 'INFO', },}Step 6: Create Environment Configuration Template
Create a .env.example file with required environment variables:
# SecuritySECRET_KEY=your-secret-key-hereDEBUG=falseALLOWED_HOSTS=example-app.klutch.sh
# Database ConfigurationDB_HOST=your-postgresql-hostDB_PORT=5432DB_NAME=alephDB_USER=alephDB_PASSWORD=your-secure-passwordDB_SSL=prefer
# Application ConfigurationALEPH_APP_TITLE=AlephALEPH_UI_URL=https://example-app.klutch.sh
# Celery Configuration (Redis)CELERY_BROKER_URL=redis://your-redis-host:6379/0CELERY_RESULT_BACKEND=redis://your-redis-host:6379/0
# Elasticsearch Configuration (Optional)ELASTICSEARCH_URL=http://your-elasticsearch-host:9200ELASTICSEARCH_INDEX=aleph
# Email Configuration (Optional)SMTP_HOST=smtp.example.comSMTP_PORT=587SMTP_USE_TLS=trueSMTP_USER=your-smtp-usernameSMTP_PASSWORD=your-smtp-passwordDEFAULT_FROM_EMAIL=noreply@example.com
# Storage PathsMEDIA_ROOT=/var/lib/aleph/dataARCHIVE_ROOT=/var/lib/aleph/archiveSTATIC_ROOT=/var/lib/aleph/static
# TimezoneTZ=UTCStep 7: Create Database Initialization Script
Create a script to initialize the database schema:
scripts/init_db.sh:
#!/bin/bashset -e
echo "Initializing Aleph database..."
# Wait for PostgreSQL to be readyuntil PGPASSWORD=$DB_PASSWORD psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d postgres -c '\q'; do >&2 echo "PostgreSQL is unavailable - sleeping" sleep 1done
echo "PostgreSQL is ready"
# Create database if it doesn't existPGPASSWORD=$DB_PASSWORD psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d postgres <<-EOSQL SELECT 'CREATE DATABASE $DB_NAME' WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = '$DB_NAME')\gexecEOSQL
# Run migrationsecho "Running database migrations..."python manage.py migrate
# Create superuser (optional, can be done via web interface)if [ -n "$ALEPH_SUPERUSER_EMAIL" ] && [ -n "$ALEPH_SUPERUSER_PASSWORD" ]; then echo "Creating superuser..." python manage.py createsuperuser --noinput --email "$ALEPH_SUPERUSER_EMAIL" || truefi
echo "Database initialization complete"Step 8: Create .dockerignore File
Create a .dockerignore file to exclude unnecessary files from the Docker build:
.git.gitignore.dockerignore.env.env.local*.mddocker-compose.ymldocker-compose.*.ymlDockerfile__pycache__*.pyc*.pyo*.pyd.Pythonvenv/env/.venvStep 9: Test Locally (Optional)
Before deploying to Klutch.sh, you can test your Aleph setup locally:
# Build the Docker imagedocker build -t my-aleph .
# Run the container (assuming you have PostgreSQL and Redis running)docker run -d \ --name aleph-test \ -p 8080:8080 \ -e DB_HOST=host.docker.internal \ -e DB_PORT=5432 \ -e DB_NAME=aleph \ -e DB_USER=aleph \ -e DB_PASSWORD=password \ -e SECRET_KEY=$(openssl rand -base64 32) \ -e CELERY_BROKER_URL=redis://host.docker.internal:6379/0 \ -e CELERY_RESULT_BACKEND=redis://host.docker.internal:6379/0 \ -v $(pwd)/data:/var/lib/aleph/data \ -v $(pwd)/archive:/var/lib/aleph/archive \ my-aleph
# Check if the application is runningcurl http://localhost:8080/api/2/statusNote: For local development with a database and Redis, you can use Docker Compose to run all services together. Docker Compose is only for local development; Klutch.sh does not support Docker Compose for deployment.
Step 10: Push to GitHub
Commit your Aleph project files to your GitHub repository:
git add .git commit -m "Initial Aleph Docker setup for Klutch.sh"git branch -M maingit remote add origin https://github.com/yourusername/aleph-klutch.gitgit push -u origin mainDeploying to Klutch.sh
Now that your Aleph project is ready and pushed to GitHub, follow these steps to deploy it on Klutch.sh with persistent storage.
Deployment Steps
-
Log in to Klutch.sh
Navigate to klutch.sh/app and sign in to your account.
-
Create a New Project
Go to Create Project and give your project a meaningful name (e.g., “Aleph Document Management”).
-
Create a New App
Navigate to Create App and configure the following settings:
-
Select Your Repository
- Choose GitHub as your Git source
- Select the repository containing your Dockerfile
- Choose the branch you want to deploy (usually
mainormaster)
Klutch.sh will automatically detect the Dockerfile in your repository root and use it for deployment.
-
Configure Traffic Type
- Traffic Type: Select HTTP (Aleph is a web application)
- Internal Port: Set to
8080(the port your Aleph container listens on, as defined in your Dockerfile)
-
Set Environment Variables
Add the following environment variables for your Aleph configuration:
Security Configuration:
SECRET_KEY: Generate usingopenssl rand -base64 32(a long random string)DEBUG: Set tofalsefor productionALLOWED_HOSTS: Your Klutch.sh app URL (e.g.,example-app.klutch.sh)
Database Configuration:
DB_HOST: Your database host (if using a Klutch.sh PostgreSQL app, use the app URL likeexample-db.klutch.sh)DB_PORT: Database port (for Klutch.sh TCP apps, use8000externally, but the internal port in your database app should be5432for PostgreSQL)DB_NAME: Your database name (e.g.,aleph)DB_USER: Database usernameDB_PASSWORD: Database passwordDB_SSL: Set topreferorrequirefor secure connections
Application Configuration:
ALEPH_APP_TITLE: Your application title (e.g.,My Aleph Instance)ALEPH_UI_URL: Your Klutch.sh app URL (e.g.,https://example-app.klutch.sh)
Celery Configuration (if using Redis):
CELERY_BROKER_URL: Redis connection URL (e.g.,redis://your-redis-host:6379/0)CELERY_RESULT_BACKEND: Redis connection URL for results
Optional - Elasticsearch Configuration:
ELASTICSEARCH_URL: Elasticsearch server URLELASTICSEARCH_INDEX: Index name (default:aleph)
Optional - Email Configuration:
SMTP_HOST: Your SMTP server hostnameSMTP_PORT: SMTP port (typically587)SMTP_USE_TLS: Set totrueSMTP_USER: SMTP usernameSMTP_PASSWORD: SMTP passwordDEFAULT_FROM_EMAIL: Default sender email address
Storage Paths:
MEDIA_ROOT: Set to/var/lib/aleph/dataARCHIVE_ROOT: Set to/var/lib/aleph/archiveSTATIC_ROOT: Set to/var/lib/aleph/static
Timezone:
TZ: Your timezone (e.g.,UTCorAmerica/New_York)
-
Attach Persistent Volumes
Aleph requires persistent storage for several directories to ensure data persists across deployments:
Data Volume:
- Mount Path:
/var/lib/aleph/data - Size: Start with 50GB minimum (100GB+ recommended for production with document collections)
This volume stores:
- Uploaded documents and files
- Processed document data
- User uploads and media files
Archive Volume:
- Mount Path:
/var/lib/aleph/archive - Size: Start with 50GB minimum (200GB+ recommended for large document archives)
This volume stores:
- Archived documents
- Backup files
- Long-term storage
Static Files Volume (Optional):
- Mount Path:
/var/lib/aleph/static - Size: 5GB (for static assets and compiled frontend files)
Logs Volume (Optional):
- Mount Path:
/var/lib/aleph/logs - Size: 10GB (for application logs)
Note: For production instances with large document collections, allocate sufficient storage. Document processing can generate significant data, so plan storage capacity accordingly.
- Mount Path:
-
Configure Additional Settings
- Region: Select the region closest to your users for optimal performance
- Compute Resources: Aleph can be resource-intensive, especially for document processing; allocate at least:
- CPU: 4+ cores recommended for document processing
- Memory: 4GB minimum (8GB+ recommended for production workloads with document processing)
- Instances: Start with 1 instance (you can scale horizontally later if needed)
-
Deploy Your Application
Click “Create” to start the deployment. Klutch.sh will:
- Automatically detect your Dockerfile in the repository root
- Build the Docker image
- Attach the persistent volume(s)
- Start your Aleph container
- Assign a URL for external access
Note: The first deployment may take several minutes as it builds the Docker image, installs dependencies, and sets up the application.
-
Initialize Database
After deployment, you’ll need to initialize your Aleph database. Connect to your PostgreSQL database and run the database migrations. You can do this by executing commands in your container or using a database migration tool.
-
Create Admin User
Once the database is initialized, create your first admin user. You can do this through the web interface after accessing your instance, or by using Django management commands if available in your deployment.
-
Access Your Application
Once deployment is complete, you’ll receive a URL like
example-app.klutch.sh. Visit this URL to access your Aleph instance and complete the setup.
Sample Code: Getting Started with Aleph
Here are some examples to help you interact with your Aleph instance:
Example 1: JavaScript Client - Fetching API Status
// Frontend JavaScript example for Aleph API
async function getApiStatus() { try { const response = await fetch('https://example-app.klutch.sh/api/2/status', { method: 'GET', headers: { 'Content-Type': 'application/json', 'Accept': 'application/json' } });
if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); }
const status = await response.json(); console.log('API Status:', status); return status; } catch (error) { console.error('Error fetching API status:', error); throw error; }}Example 2: Searching Documents
async function searchDocuments(query, filters = {}) { try { const params = new URLSearchParams({ q: query, ...filters });
const response = await fetch( `https://example-app.klutch.sh/api/2/entities?${params}`, { method: 'GET', headers: { 'Content-Type': 'application/json', 'Accept': 'application/json' } } );
if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); }
const results = await response.json(); console.log('Search Results:', results); return results; } catch (error) { console.error('Error searching documents:', error); throw error; }}
// Example usagesearchDocuments('corruption', { limit: 20, offset: 0 });Example 3: Authenticated API Request
async function getCollections(apiKey) { try { const response = await fetch('https://example-app.klutch.sh/api/2/collections', { method: 'GET', headers: { 'Content-Type': 'application/json', 'Accept': 'application/json', 'Authorization': `ApiKey ${apiKey}` } });
if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); }
const collections = await response.json(); return collections; } catch (error) { console.error('Error fetching collections:', error); throw error; }}Example 4: Python Client Example
import requestsimport json
class AlephClient: def __init__(self, base_url, api_key=None): self.base_url = base_url self.api_key = api_key self.headers = { 'Content-Type': 'application/json', 'Accept': 'application/json' } if api_key: self.headers['Authorization'] = f'ApiKey {api_key}'
def get_status(self): """Get API status""" response = requests.get( f'{self.base_url}/api/2/status', headers=self.headers ) response.raise_for_status() return response.json()
def search_entities(self, query, limit=20, offset=0): """Search for entities""" params = { 'q': query, 'limit': limit, 'offset': offset } response = requests.get( f'{self.base_url}/api/2/entities', headers=self.headers, params=params ) response.raise_for_status() return response.json()
def get_collections(self): """Get all collections""" response = requests.get( f'{self.base_url}/api/2/collections', headers=self.headers ) response.raise_for_status() return response.json()
def upload_document(self, collection_id, file_path, metadata=None): """Upload a document to a collection""" files = {'file': open(file_path, 'rb')} data = {'collection_id': collection_id} if metadata: data.update(metadata)
response = requests.post( f'{self.base_url}/api/2/documents', headers={'Authorization': self.headers['Authorization']}, files=files, data=data ) response.raise_for_status() return response.json()
# Example usageclient = AlephClient('https://example-app.klutch.sh', api_key='your-api-key')
# Get statusstatus = client.get_status()print(f"API Version: {status.get('version')}")
# Search for entitiesresults = client.search_entities('corruption', limit=10)print(f"Found {len(results.get('results', []))} entities")Production Best Practices
Security Recommendations
- Enable HTTPS: Always use HTTPS in production (Klutch.sh provides TLS certificates)
- Secure Environment Variables: Store all sensitive credentials as environment variables in Klutch.sh
- Strong Secrets: Generate strong
SECRET_KEYvalues using secure random generators - Database Security: Use strong database passwords and enable SSL connections
- Access Control: Implement proper access control and permissions for sensitive documents
- API Security: Use API keys for programmatic access and rotate them regularly
- Input Validation: Always validate and sanitize user input
- File Upload Security: Implement file type validation and virus scanning for uploads
- Regular Updates: Keep Aleph and dependencies updated with security patches
- Backup Strategy: Regularly backup your database and document archives
Performance Optimization
- Database Optimization: Regularly optimize PostgreSQL database with VACUUM and ANALYZE
- Document Processing: Configure appropriate worker processes for document processing
- Caching: Implement caching strategies for frequently accessed data
- CDN Integration: Consider using a CDN for static assets
- Connection Pooling: Configure appropriate database connection pool sizes
- Resource Monitoring: Monitor CPU, memory, and storage usage
- Elasticsearch: Use Elasticsearch for better search performance with large document collections
- Background Processing: Use Celery workers for asynchronous document processing
Document Management Best Practices
- Storage Planning: Plan storage capacity based on expected document volume
- Backup Strategy: Implement regular backups of document archives
- Access Control: Establish clear access control policies
- Metadata Management: Maintain proper metadata for all documents
- Version Control: Consider version control for important documents
- Retention Policies: Implement document retention and archival policies
Monitoring and Maintenance
Monitor your Aleph application for:
- Application Logs: Check logs in Klutch.sh dashboard for errors
- Database Performance: Monitor query performance and slow queries
- Storage Usage: Monitor persistent volume usage and plan for growth
- Response Times: Track API response times
- Error Rates: Monitor 4xx and 5xx error rates
- Resource Usage: Track CPU and memory usage in Klutch.sh dashboard
- Document Processing: Monitor document processing queue and worker status
Regular maintenance tasks:
- Backup Database: Regularly backup your PostgreSQL database
- Backup Documents: Backup document archives from persistent volumes
- Update Dependencies: Keep Python dependencies updated
- Review Logs: Review application and error logs regularly
- Security Audits: Perform regular security audits
- Database Maintenance: Regularly run database maintenance tasks
- Storage Cleanup: Clean up temporary files and old processing artifacts
Troubleshooting
Application Not Loading
- Verify the app’s Traffic Type is HTTP
- Check that the internal port is set to
8080and matches your Dockerfile - Review build and runtime logs in the Klutch.sh dashboard
- Ensure the Python application starts correctly (check the CMD in Dockerfile)
- Verify all required environment variables are set
Database Connection Issues
- Verify database environment variables are set correctly
- For Klutch.sh PostgreSQL apps, use the app URL as the host and port
8000externally - Check that the database is accessible from your Aleph app
- Verify database credentials and permissions
- Ensure the database schema has been initialized with migrations
Document Processing Issues
- Ensure all required system dependencies are installed (Tesseract, LibreOffice, etc.)
- Check file permissions on data and archive directories
- Verify sufficient disk space in persistent volumes
- Review document processing logs for errors
- Check Celery worker configuration if using background processing
Search Issues
- Verify Elasticsearch configuration if using Elasticsearch
- Check search index status
- Rebuild search indexes if necessary
- Review search query logs
Performance Issues
- Review database query performance and add indexes if needed
- Check resource allocation in Klutch.sh (CPU and memory)
- Monitor document processing queue
- Review application logs for slow operations
- Consider implementing caching for frequently accessed data
- Optimize document processing settings
Data Not Persisting
- Ensure persistent volumes are correctly mounted
- Check file permissions on persistent volumes
- Verify the application is writing to the correct directories
- Ensure sufficient disk space in persistent volumes
Related Documentation
- Learn more about deploying applications on Klutch.sh in Deployments
- Understand traffic types, ports, and routing in Networking
- Explore how to work with storage in Volumes
- Browse the full platform documentation at Klutch.sh Documentation
- For Aleph-specific details, see the official Aleph GitHub Repository
- Learn about document management and investigation workflows
Conclusion
Deploying Aleph to Klutch.sh with a Dockerfile provides a scalable, reliable document management and investigation platform with persistent storage, automatic deployments, and production-ready configuration. By following this guide, you’ve set up a high-performance Aleph instance with proper data persistence, security configurations, and the ability to handle large document collections.
Aleph’s powerful document processing capabilities, entity extraction, and collaborative investigation tools make it an excellent choice for journalists, researchers, and investigators. Your application is now ready to collect, analyze, and share large collections of documents while maintaining security and performance.
Remember to follow the production best practices outlined in this guide, regularly monitor your application performance, and adjust resources as your document collection grows. With proper configuration, monitoring, and maintenance, Aleph on Klutch.sh will provide a reliable, secure foundation for your document management and investigation needs.