Multi-Format Archiving
Saves pages as HTML, PDF, PNG screenshots, WARC, singlefile HTML, and extracts article text automatically.
ArchiveBox is a powerful, open-source self-hosted web archiving solution that lets you preserve content from websites in multiple formats. Whether you’re saving bookmarks, preserving evidence for legal cases, backing up social media content, or archiving research papers, ArchiveBox captures web pages in durable, long-term formats like HTML, PDF, PNG, WARC, and more.
Unlike centralized archiving services, ArchiveBox gives you complete control over your data while supporting imports from browser history, bookmarks, RSS feeds, Pocket, Pinboard, and numerous other sources. With built-in support for Chrome/Chromium, wget, yt-dlp, and readability extractors, ArchiveBox ensures high-fidelity archives that remain accessible for decades.
This guide walks you through deploying ArchiveBox on Klutch.sh using Docker, complete with persistent storage for your archive data.
Multi-Format Archiving
Saves pages as HTML, PDF, PNG screenshots, WARC, singlefile HTML, and extracts article text automatically.
Privacy-First
Self-hosted solution means you own your data. Archive both public and private content while maintaining control.
Extensive Input Sources
Import from browser bookmarks, history, RSS feeds, Pocket, Pinboard, Reddit saved posts, and many more.
100% Open Source
MIT-licensed with active development, comprehensive documentation, and a rich community of contributors.
Before deploying ArchiveBox on Klutch.sh, ensure you have the following:
ArchiveBox uses a Django-powered web interface with SQLite for metadata storage. All archived content is saved to the filesystem in organized folders:
┌─────────────────────────────────────────────────────────────┐│ ArchiveBox ││ (Port 8000) │├─────────────────────────────────────────────────────────────┤│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ ││ │ Django │ │ SQLite │ │ Archive Storage │ ││ │ Web UI │ │ Database │ │ /data/archive/ │ ││ └─────────────┘ └─────────────┘ └─────────────────────┘ │├─────────────────────────────────────────────────────────────┤│ Extractors: Chrome, wget, yt-dlp, readability, singlefile │└─────────────────────────────────────────────────────────────┘For each URL archived, ArchiveBox creates a snapshot folder containing:
| Format | Description |
|---|---|
| HTML | Original HTML with CSS/JS, plus singlefile HTML |
| Full-page PDF rendering | |
| PNG | Full-page screenshot |
| WARC | Web ARChive format for archival purposes |
| TXT | Extracted article text via readability |
| JSON | Metadata and page information |
| Media | Videos, audio, and subtitles via yt-dlp |
| Git | Cloned repositories from GitHub/GitLab |
Create a Dockerfile in your project root that uses the official ArchiveBox image:
FROM archivebox/archivebox:latest
# Set environment variables for initial setupENV DATA_DIR=/dataENV ALLOWED_HOSTS=*
# Expose the web interface portEXPOSE 8000
# The base image handles initialization and server startup# Data is stored in /data which should be a persistent volumeConfigure ArchiveBox behavior through environment variables in the Klutch.sh dashboard:
| Variable | Description | Default |
|---|---|---|
ADMIN_USERNAME | Username for the admin account (set on first run) | None |
ADMIN_PASSWORD | Password for the admin account (set on first run) | None |
PUBLIC_INDEX | Allow viewing snapshot list without login | True |
PUBLIC_SNAPSHOTS | Allow viewing snapshot content without login | True |
PUBLIC_ADD_VIEW | Allow submitting URLs without login | False |
| Variable | Description | Default |
|---|---|---|
TIMEOUT | Max download time per method (seconds) | 60 |
MEDIA_TIMEOUT | Max download time for media files (seconds) | 3600 |
CHECK_SSL_VALIDITY | Enforce HTTPS certificate validation | True |
RESOLUTION | Screenshot resolution (width,height) | 1440,2000 |
| Variable | Description | Default |
|---|---|---|
SAVE_TITLE | Extract page title | True |
SAVE_FAVICON | Save site favicon | True |
SAVE_WGET | Archive with wget | True |
SAVE_WARC | Save WARC archive | True |
SAVE_PDF | Generate PDF | True |
SAVE_SCREENSHOT | Capture screenshot | True |
SAVE_DOM | Save DOM dump | True |
SAVE_SINGLEFILE | Create singlefile HTML | True |
SAVE_READABILITY | Extract article text | True |
SAVE_MEDIA | Download media with yt-dlp | True |
SAVE_GIT | Clone git repositories | True |
SAVE_ARCHIVE_DOT_ORG | Submit to Archive.org | True |
| Variable | Description | Default |
|---|---|---|
PUID | User ID for file ownership | 911 |
PGID | Group ID for file ownership | 911 |
OUTPUT_PERMISSIONS | File permissions for output | 755 |
Your ArchiveBox deployment repository should have the following structure:
archivebox-deploy/├── Dockerfile└── README.mdPush your repository to GitHub
Create a new repository on GitHub and push your Dockerfile:
git initgit add .git commit -m "Initial ArchiveBox deployment configuration"git branch -M maingit remote add origin https://github.com/yourusername/archivebox-deploy.gitgit push -u origin mainConnect to Klutch.sh
Navigate to klutch.sh/app and sign in with your GitHub account. Click New Project to begin the deployment process.
Select your repository
Choose the GitHub repository containing your ArchiveBox Dockerfile. Klutch.sh will automatically detect the Dockerfile in your project root.
Configure environment variables
Add the following environment variables in the Klutch.sh dashboard:
ADMIN_USERNAME - Your admin username (e.g., admin)ADMIN_PASSWORD - A strong password for the admin accountPUBLIC_ADD_VIEW - Set to False to require login for adding URLsALLOWED_HOSTS - Set to * or your specific domainSet the internal port
Configure the internal port to 8000 where ArchiveBox serves its web interface.
Add persistent storage
ArchiveBox requires persistent storage for your archive data. Add a persistent volume:
| Mount Path | Size |
|---|---|
/data | 10 GB+ |
Deploy your application
Click Deploy to start the deployment process. Klutch.sh will build the Docker image and deploy your ArchiveBox instance.
After deployment, you’ll need to initialize your archive and access the admin interface:
Access your ArchiveBox instance
Navigate to your deployed ArchiveBox URL (e.g., https://example-app.klutch.sh). You should see the ArchiveBox web interface.
Log in to the Admin Panel
Click Admin in the navigation and log in with the ADMIN_USERNAME and ADMIN_PASSWORD you configured in the environment variables.
Add your first URL
From the main interface, enter a URL in the input field and click Add. ArchiveBox will begin archiving the page using all enabled extractors.
The simplest way to add URLs is through the web UI:
Install the ArchiveBox Browser Extension to archive pages directly from Chrome or Firefox:
ArchiveBox can import URLs from many sources:
Use the --depth=1 flag (via the admin interface) to archive not just the URL but all links found within:
https://example.com/blog/This will archive the blog index page plus all linked articles.
ArchiveBox supports scheduled imports from RSS feeds and other sources. Configure scheduled tasks through the admin interface under Scheduled Tasks.
ArchiveBox includes full-text search across your archived content using ripgrep. Search from the main interface to find content within your archives.
If certain extractors aren’t needed or are causing issues, disable them via environment variables:
SAVE_MEDIA=False # Disable yt-dlp media downloadsSAVE_ARCHIVE_DOT_ORG=False # Don't submit to Archive.orgSAVE_GIT=False # Don't clone git repositoriesFor sites that require authentication, you can configure cookies:
cookies.txt file to your ArchiveBox /data volumeCOOKIES_FILE=/data/cookies.txt in environment variablesIf sites are blocking ArchiveBox, customize the user agent:
WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"To use a custom domain with your ArchiveBox deployment:
Add your domain in Klutch.sh
Navigate to your project settings in the Klutch.sh dashboard and add your custom domain (e.g., archive.yourdomain.com).
Configure DNS records
Add a CNAME record pointing to your Klutch.sh deployment:
| Type | Name | Value |
|---|---|---|
| CNAME | archive | example-app.klutch.sh |
Update ALLOWED_HOSTS
Update the ALLOWED_HOSTS environment variable to include your custom domain:
ALLOWED_HOSTS=archive.yourdomain.com,example-app.klutch.shWait for SSL provisioning
Klutch.sh automatically provisions SSL certificates for custom domains. This may take a few minutes.
For local development and testing, use Docker Compose:
services: archivebox: image: archivebox/archivebox:latest restart: unless-stopped ports: - "8000:8000" volumes: - ./data:/data environment: - ADMIN_USERNAME=admin - ADMIN_PASSWORD=your-secure-password - ALLOWED_HOSTS=* - PUBLIC_INDEX=True - PUBLIC_SNAPSHOTS=True - PUBLIC_ADD_VIEW=False - PUID=1000 - PGID=1000
volumes: data:Start the local environment:
mkdir -p datadocker compose up -dInitialize and create admin user:
docker compose run archivebox init --setupdocker compose run archivebox manage createsuperuserAccess ArchiveBox at http://localhost:8000.
Browse your archived snapshots through the main web interface. Each snapshot shows:
Archives are stored in /data/archive/<timestamp>/ folders. Each snapshot contains:
/data/archive/1699876543/├── index.html # Archive index page├── index.json # Metadata├── screenshot.png # Full-page screenshot├── output.pdf # PDF rendering├── singlefile.html # Self-contained HTML├── readability/ # Extracted article text├── warc/ # WARC archive files├── wget/ # wget mirror│ └── example.com/└── media/ # Downloaded media filesIf you see Chrome-related errors, the container may need more resources. Ensure your Klutch.sh deployment has adequate memory allocated.
If archiving fails with permission errors, verify the PUID and PGID environment variables match the container’s expected user.
For very large archives, consider:
SAVE_MEDIA=False) to save space/data directoryMany sites actively block bots. Try:
CHECK_SSL_VALIDITY=False for sites with certificate issuesAfter deploying ArchiveBox, consider:
/data volume