Deploying ArchiveBox

ArchiveBox is a powerful, open-source self-hosted web archiving solution that lets you preserve content from websites in multiple formats. Whether you’re saving bookmarks, preserving evidence for legal cases, backing up social media content, or archiving research papers, ArchiveBox captures web pages in durable, long-term formats like HTML, PDF, PNG, WARC, and more.

Unlike centralized archiving services, ArchiveBox gives you complete control over your data while supporting imports from browser history, bookmarks, RSS feeds, Pocket, Pinboard, and numerous other sources. With built-in support for Chrome/Chromium, wget, yt-dlp, and readability extractors, ArchiveBox ensures high-fidelity archives that remain accessible for decades.

This guide walks you through deploying ArchiveBox on Klutch.sh using Docker, complete with persistent storage for your archive data.

Why Choose ArchiveBox?

Multi-Format Archiving

Saves pages as HTML, PDF, PNG screenshots, WARC, singlefile HTML, and extracts article text automatically.

Privacy-First

Self-hosted solution means you own your data. Archive both public and private content while maintaining control.

Extensive Input Sources

Import from browser bookmarks, history, RSS feeds, Pocket, Pinboard, Reddit saved posts, and many more.

100% Open Source

MIT-licensed with active development, comprehensive documentation, and a rich community of contributors.

Prerequisites

Before deploying ArchiveBox on Klutch.sh, ensure you have the following:

A GitHub account for repository hosting
A Klutch.sh account for deployment

Architecture Overview

ArchiveBox uses a Django-powered web interface with SQLite for metadata storage. All archived content is saved to the filesystem in organized folders:

┌─────────────────────────────────────────────────────────────┐
│                      ArchiveBox                              │
│                      (Port 8000)                             │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Django    │  │   SQLite    │  │   Archive Storage   │  │
│  │   Web UI    │  │   Database  │  │   /data/archive/    │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│  Extractors: Chrome, wget, yt-dlp, readability, singlefile  │
└─────────────────────────────────────────────────────────────┘

Output Formats

For each URL archived, ArchiveBox creates a snapshot folder containing:

Format	Description
HTML	Original HTML with CSS/JS, plus singlefile HTML
PDF	Full-page PDF rendering
PNG	Full-page screenshot
WARC	Web ARChive format for archival purposes
TXT	Extracted article text via readability
JSON	Metadata and page information
Media	Videos, audio, and subtitles via yt-dlp
Git	Cloned repositories from GitHub/GitLab

Creating the Dockerfile

Create a Dockerfile in your project root that uses the official ArchiveBox image:

FROM archivebox/archivebox:latest

# Set environment variables for initial setup
ENV DATA_DIR=/data
ENV ALLOWED_HOSTS=*

# Expose the web interface port
EXPOSE 8000

# The base image handles initialization and server startup
# Data is stored in /data which should be a persistent volume

Environment Variables

Configure ArchiveBox behavior through environment variables in the Klutch.sh dashboard:

Authentication & Access

Variable	Description	Default
`ADMIN_USERNAME`	Username for the admin account (set on first run)	None
`ADMIN_PASSWORD`	Password for the admin account (set on first run)	None
`PUBLIC_INDEX`	Allow viewing snapshot list without login	`True`
`PUBLIC_SNAPSHOTS`	Allow viewing snapshot content without login	`True`
`PUBLIC_ADD_VIEW`	Allow submitting URLs without login	`False`

Archive Settings

Variable	Description	Default
`TIMEOUT`	Max download time per method (seconds)	`60`
`MEDIA_TIMEOUT`	Max download time for media files (seconds)	`3600`
`CHECK_SSL_VALIDITY`	Enforce HTTPS certificate validation	`True`
`RESOLUTION`	Screenshot resolution (width,height)	`1440,2000`

Archive Method Toggles

Variable	Description	Default
`SAVE_TITLE`	Extract page title	`True`
`SAVE_FAVICON`	Save site favicon	`True`
`SAVE_WGET`	Archive with wget	`True`
`SAVE_WARC`	Save WARC archive	`True`
`SAVE_PDF`	Generate PDF	`True`
`SAVE_SCREENSHOT`	Capture screenshot	`True`
`SAVE_DOM`	Save DOM dump	`True`
`SAVE_SINGLEFILE`	Create singlefile HTML	`True`
`SAVE_READABILITY`	Extract article text	`True`
`SAVE_MEDIA`	Download media with yt-dlp	`True`
`SAVE_GIT`	Clone git repositories	`True`
`SAVE_ARCHIVE_DOT_ORG`	Submit to Archive.org	`True`

User & Permissions (Docker)

Variable	Description	Default
`PUID`	User ID for file ownership	`911`
`PGID`	Group ID for file ownership	`911`
`OUTPUT_PERMISSIONS`	File permissions for output	`755`

Project Structure

Your ArchiveBox deployment repository should have the following structure:

archivebox-deploy/
├── Dockerfile
└── README.md

Deploying to Klutch.sh

Push your repository to GitHub

Create a new repository on GitHub and push your Dockerfile:

git init
git add .
git commit -m "Initial ArchiveBox deployment configuration"
git branch -M main
git remote add origin https://github.com/yourusername/archivebox-deploy.git
git push -u origin main

Connect to Klutch.sh
Navigate to klutch.sh/app and sign in with your GitHub account. Click New Project to begin the deployment process.
Select your repository
Choose the GitHub repository containing your ArchiveBox Dockerfile. Klutch.sh will automatically detect the Dockerfile in your project root.
Configure environment variables
Add the following environment variables in the Klutch.sh dashboard:
- ADMIN_USERNAME - Your admin username (e.g., admin)
- ADMIN_PASSWORD - A strong password for the admin account
- PUBLIC_ADD_VIEW - Set to False to require login for adding URLs
- ALLOWED_HOSTS - Set to * or your specific domain
Set the internal port
Configure the internal port to 8000 where ArchiveBox serves its web interface.
Add persistent storage
ArchiveBox requires persistent storage for your archive data. Add a persistent volume:

Mount Path Size
/data 10 GB+

ArchiveBox can use significant disk space depending on what you archive. Media downloads (videos, audio) can quickly consume storage. Start with at least 10 GB and increase as needed.
Deploy your application
Click Deploy to start the deployment process. Klutch.sh will build the Docker image and deploy your ArchiveBox instance.

Mount Path	Size
`/data`	10 GB+

First-Time Setup

After deployment, you’ll need to initialize your archive and access the admin interface:

Access your ArchiveBox instance
Navigate to your deployed ArchiveBox URL (e.g., https://example-app.klutch.sh). You should see the ArchiveBox web interface.
Log in to the Admin Panel
Click Admin in the navigation and log in with the ADMIN_USERNAME and ADMIN_PASSWORD you configured in the environment variables.
Add your first URL
From the main interface, enter a URL in the input field and click Add. ArchiveBox will begin archiving the page using all enabled extractors.

Adding URLs to Your Archive

Via the Web Interface

The simplest way to add URLs is through the web UI:

Navigate to your ArchiveBox instance
Enter a URL in the “Add new URLs” input field
Click Add to start archiving

Via the Browser Extension

Install the ArchiveBox Browser Extension to archive pages directly from Chrome or Firefox:

Install the extension from your browser’s extension store
Configure it to point to your ArchiveBox instance URL
Click the extension icon to archive the current page

Supported Input Sources

ArchiveBox can import URLs from many sources:

Browser Exports: Chrome, Firefox, Safari bookmarks and history
Bookmark Services: Pocket, Pinboard, Instapaper, Wallabag
Social Media: Reddit saved posts, Twitter bookmarks
RSS Feeds: Any RSS/Atom feed URL
Plain Text: Lists of URLs in any text file format
HTML Files: Pages containing links to archive

Archive Features

Recursive Archiving

Use the --depth=1 flag (via the admin interface) to archive not just the URL but all links found within:

https://example.com/blog/

This will archive the blog index page plus all linked articles.

Scheduled Archiving

ArchiveBox supports scheduled imports from RSS feeds and other sources. Configure scheduled tasks through the admin interface under Scheduled Tasks.

Search

ArchiveBox includes full-text search across your archived content using ripgrep. Search from the main interface to find content within your archives.

Advanced Configuration

Disabling Extractors

If certain extractors aren’t needed or are causing issues, disable them via environment variables:

SAVE_MEDIA=False       # Disable yt-dlp media downloads
SAVE_ARCHIVE_DOT_ORG=False  # Don't submit to Archive.org
SAVE_GIT=False         # Don't clone git repositories

For sites that require authentication, you can configure cookies:

Export cookies from your browser using a cookie export extension
Upload the cookies.txt file to your ArchiveBox /data volume
Set COOKIES_FILE=/data/cookies.txt in environment variables

Custom User Agent

If sites are blocking ArchiveBox, customize the user agent:

WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

Custom Domain Setup

To use a custom domain with your ArchiveBox deployment:

Add your domain in Klutch.sh
Navigate to your project settings in the Klutch.sh dashboard and add your custom domain (e.g., archive.yourdomain.com).
Configure DNS records
Add a CNAME record pointing to your Klutch.sh deployment:

Type Name Value
CNAME archive example-app.klutch.sh
Update ALLOWED_HOSTS
Update the ALLOWED_HOSTS environment variable to include your custom domain:
```
ALLOWED_HOSTS=archive.yourdomain.com,example-app.klutch.sh
```
Wait for SSL provisioning
Klutch.sh automatically provisions SSL certificates for custom domains. This may take a few minutes.

Type	Name	Value
CNAME	archive	example-app.klutch.sh

Local Development with Docker Compose

For local development and testing, use Docker Compose:

services:
  archivebox:
    image: archivebox/archivebox:latest
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - ./data:/data
    environment:
      - ADMIN_USERNAME=admin
      - ADMIN_PASSWORD=your-secure-password
      - ALLOWED_HOSTS=*
      - PUBLIC_INDEX=True
      - PUBLIC_SNAPSHOTS=True
      - PUBLIC_ADD_VIEW=False
      - PUID=1000
      - PGID=1000

volumes:
  data:

Start the local environment:

mkdir -p data
docker compose up -d

Initialize and create admin user:

docker compose run archivebox init --setup
docker compose run archivebox manage createsuperuser

Access ArchiveBox at http://localhost:8000.

Viewing Your Archives

Web Interface

Browse your archived snapshots through the main web interface. Each snapshot shows:

Original URL and timestamp
Available archive formats (HTML, PDF, screenshot, etc.)
Tags and metadata
Links to individual archive files

Direct Filesystem Access

Archives are stored in /data/archive/<timestamp>/ folders. Each snapshot contains:

/data/archive/1699876543/
├── index.html           # Archive index page
├── index.json           # Metadata
├── screenshot.png       # Full-page screenshot
├── output.pdf           # PDF rendering
├── singlefile.html      # Self-contained HTML
├── readability/         # Extracted article text
├── warc/                # WARC archive files
├── wget/                # wget mirror
│   └── example.com/
└── media/               # Downloaded media files

Troubleshooting

Chrome/Chromium Errors

If you see Chrome-related errors, the container may need more resources. Ensure your Klutch.sh deployment has adequate memory allocated.

Permission Issues

If archiving fails with permission errors, verify the PUID and PGID environment variables match the container’s expected user.

Large Archives

For very large archives, consider:

Disabling media downloads (SAVE_MEDIA=False) to save space
Increasing persistent volume size
Using external storage solutions for the /data directory

Sites Blocking Archiving

Many sites actively block bots. Try:

Setting custom user agents
Adding cookies for authentication
Enabling CHECK_SSL_VALIDITY=False for sites with certificate issues

Resources

Next Steps

After deploying ArchiveBox, consider:

Installing the browser extension for easy archiving
Setting up scheduled imports from RSS feeds
Configuring authentication for private archives
Exploring the REST API for automation
Setting up regular backups of your /data volume
Customizing which extractors are enabled based on your needs