Skip to content

Deploying ArchiveBox

ArchiveBox is a powerful, open-source self-hosted web archiving solution that lets you preserve content from websites in multiple formats. Whether you’re saving bookmarks, preserving evidence for legal cases, backing up social media content, or archiving research papers, ArchiveBox captures web pages in durable, long-term formats like HTML, PDF, PNG, WARC, and more.

Unlike centralized archiving services, ArchiveBox gives you complete control over your data while supporting imports from browser history, bookmarks, RSS feeds, Pocket, Pinboard, and numerous other sources. With built-in support for Chrome/Chromium, wget, yt-dlp, and readability extractors, ArchiveBox ensures high-fidelity archives that remain accessible for decades.

This guide walks you through deploying ArchiveBox on Klutch.sh using Docker, complete with persistent storage for your archive data.

Why Choose ArchiveBox?

Multi-Format Archiving

Saves pages as HTML, PDF, PNG screenshots, WARC, singlefile HTML, and extracts article text automatically.

Privacy-First

Self-hosted solution means you own your data. Archive both public and private content while maintaining control.

Extensive Input Sources

Import from browser bookmarks, history, RSS feeds, Pocket, Pinboard, Reddit saved posts, and many more.

100% Open Source

MIT-licensed with active development, comprehensive documentation, and a rich community of contributors.

Prerequisites

Before deploying ArchiveBox on Klutch.sh, ensure you have the following:

Architecture Overview

ArchiveBox uses a Django-powered web interface with SQLite for metadata storage. All archived content is saved to the filesystem in organized folders:

┌─────────────────────────────────────────────────────────────┐
│ ArchiveBox │
│ (Port 8000) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Django │ │ SQLite │ │ Archive Storage │ │
│ │ Web UI │ │ Database │ │ /data/archive/ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Extractors: Chrome, wget, yt-dlp, readability, singlefile │
└─────────────────────────────────────────────────────────────┘

Output Formats

For each URL archived, ArchiveBox creates a snapshot folder containing:

FormatDescription
HTMLOriginal HTML with CSS/JS, plus singlefile HTML
PDFFull-page PDF rendering
PNGFull-page screenshot
WARCWeb ARChive format for archival purposes
TXTExtracted article text via readability
JSONMetadata and page information
MediaVideos, audio, and subtitles via yt-dlp
GitCloned repositories from GitHub/GitLab

Creating the Dockerfile

Create a Dockerfile in your project root that uses the official ArchiveBox image:

Dockerfile
FROM archivebox/archivebox:latest
# Set environment variables for initial setup
ENV DATA_DIR=/data
ENV ALLOWED_HOSTS=*
# Expose the web interface port
EXPOSE 8000
# The base image handles initialization and server startup
# Data is stored in /data which should be a persistent volume

Environment Variables

Configure ArchiveBox behavior through environment variables in the Klutch.sh dashboard:

Authentication & Access

VariableDescriptionDefault
ADMIN_USERNAMEUsername for the admin account (set on first run)None
ADMIN_PASSWORDPassword for the admin account (set on first run)None
PUBLIC_INDEXAllow viewing snapshot list without loginTrue
PUBLIC_SNAPSHOTSAllow viewing snapshot content without loginTrue
PUBLIC_ADD_VIEWAllow submitting URLs without loginFalse

Archive Settings

VariableDescriptionDefault
TIMEOUTMax download time per method (seconds)60
MEDIA_TIMEOUTMax download time for media files (seconds)3600
CHECK_SSL_VALIDITYEnforce HTTPS certificate validationTrue
RESOLUTIONScreenshot resolution (width,height)1440,2000

Archive Method Toggles

VariableDescriptionDefault
SAVE_TITLEExtract page titleTrue
SAVE_FAVICONSave site faviconTrue
SAVE_WGETArchive with wgetTrue
SAVE_WARCSave WARC archiveTrue
SAVE_PDFGenerate PDFTrue
SAVE_SCREENSHOTCapture screenshotTrue
SAVE_DOMSave DOM dumpTrue
SAVE_SINGLEFILECreate singlefile HTMLTrue
SAVE_READABILITYExtract article textTrue
SAVE_MEDIADownload media with yt-dlpTrue
SAVE_GITClone git repositoriesTrue
SAVE_ARCHIVE_DOT_ORGSubmit to Archive.orgTrue

User & Permissions (Docker)

VariableDescriptionDefault
PUIDUser ID for file ownership911
PGIDGroup ID for file ownership911
OUTPUT_PERMISSIONSFile permissions for output755

Project Structure

Your ArchiveBox deployment repository should have the following structure:

archivebox-deploy/
├── Dockerfile
└── README.md

Deploying to Klutch.sh

  1. Push your repository to GitHub

    Create a new repository on GitHub and push your Dockerfile:

    Terminal window
    git init
    git add .
    git commit -m "Initial ArchiveBox deployment configuration"
    git branch -M main
    git remote add origin https://github.com/yourusername/archivebox-deploy.git
    git push -u origin main
  2. Connect to Klutch.sh

    Navigate to klutch.sh/app and sign in with your GitHub account. Click New Project to begin the deployment process.

  3. Select your repository

    Choose the GitHub repository containing your ArchiveBox Dockerfile. Klutch.sh will automatically detect the Dockerfile in your project root.

  4. Configure environment variables

    Add the following environment variables in the Klutch.sh dashboard:

    • ADMIN_USERNAME - Your admin username (e.g., admin)
    • ADMIN_PASSWORD - A strong password for the admin account
    • PUBLIC_ADD_VIEW - Set to False to require login for adding URLs
    • ALLOWED_HOSTS - Set to * or your specific domain
  5. Set the internal port

    Configure the internal port to 8000 where ArchiveBox serves its web interface.

  6. Add persistent storage

    ArchiveBox requires persistent storage for your archive data. Add a persistent volume:

    Mount PathSize
    /data10 GB+
  7. Deploy your application

    Click Deploy to start the deployment process. Klutch.sh will build the Docker image and deploy your ArchiveBox instance.

First-Time Setup

After deployment, you’ll need to initialize your archive and access the admin interface:

  1. Access your ArchiveBox instance

    Navigate to your deployed ArchiveBox URL (e.g., https://example-app.klutch.sh). You should see the ArchiveBox web interface.

  2. Log in to the Admin Panel

    Click Admin in the navigation and log in with the ADMIN_USERNAME and ADMIN_PASSWORD you configured in the environment variables.

  3. Add your first URL

    From the main interface, enter a URL in the input field and click Add. ArchiveBox will begin archiving the page using all enabled extractors.

Adding URLs to Your Archive

Via the Web Interface

The simplest way to add URLs is through the web UI:

  1. Navigate to your ArchiveBox instance
  2. Enter a URL in the “Add new URLs” input field
  3. Click Add to start archiving

Via the Browser Extension

Install the ArchiveBox Browser Extension to archive pages directly from Chrome or Firefox:

  1. Install the extension from your browser’s extension store
  2. Configure it to point to your ArchiveBox instance URL
  3. Click the extension icon to archive the current page

Supported Input Sources

ArchiveBox can import URLs from many sources:

  • Browser Exports: Chrome, Firefox, Safari bookmarks and history
  • Bookmark Services: Pocket, Pinboard, Instapaper, Wallabag
  • Social Media: Reddit saved posts, Twitter bookmarks
  • RSS Feeds: Any RSS/Atom feed URL
  • Plain Text: Lists of URLs in any text file format
  • HTML Files: Pages containing links to archive

Archive Features

Recursive Archiving

Use the --depth=1 flag (via the admin interface) to archive not just the URL but all links found within:

https://example.com/blog/

This will archive the blog index page plus all linked articles.

Scheduled Archiving

ArchiveBox supports scheduled imports from RSS feeds and other sources. Configure scheduled tasks through the admin interface under Scheduled Tasks.

ArchiveBox includes full-text search across your archived content using ripgrep. Search from the main interface to find content within your archives.

Advanced Configuration

Disabling Extractors

If certain extractors aren’t needed or are causing issues, disable them via environment variables:

SAVE_MEDIA=False # Disable yt-dlp media downloads
SAVE_ARCHIVE_DOT_ORG=False # Don't submit to Archive.org
SAVE_GIT=False # Don't clone git repositories

Archiving Sites Requiring Login

For sites that require authentication, you can configure cookies:

  1. Export cookies from your browser using a cookie export extension
  2. Upload the cookies.txt file to your ArchiveBox /data volume
  3. Set COOKIES_FILE=/data/cookies.txt in environment variables

Custom User Agent

If sites are blocking ArchiveBox, customize the user agent:

WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

Custom Domain Setup

To use a custom domain with your ArchiveBox deployment:

  1. Add your domain in Klutch.sh

    Navigate to your project settings in the Klutch.sh dashboard and add your custom domain (e.g., archive.yourdomain.com).

  2. Configure DNS records

    Add a CNAME record pointing to your Klutch.sh deployment:

    TypeNameValue
    CNAMEarchiveexample-app.klutch.sh
  3. Update ALLOWED_HOSTS

    Update the ALLOWED_HOSTS environment variable to include your custom domain:

    ALLOWED_HOSTS=archive.yourdomain.com,example-app.klutch.sh
  4. Wait for SSL provisioning

    Klutch.sh automatically provisions SSL certificates for custom domains. This may take a few minutes.

Local Development with Docker Compose

For local development and testing, use Docker Compose:

docker-compose.yml
services:
archivebox:
image: archivebox/archivebox:latest
restart: unless-stopped
ports:
- "8000:8000"
volumes:
- ./data:/data
environment:
- ADMIN_USERNAME=admin
- ADMIN_PASSWORD=your-secure-password
- ALLOWED_HOSTS=*
- PUBLIC_INDEX=True
- PUBLIC_SNAPSHOTS=True
- PUBLIC_ADD_VIEW=False
- PUID=1000
- PGID=1000
volumes:
data:

Start the local environment:

Terminal window
mkdir -p data
docker compose up -d

Initialize and create admin user:

Terminal window
docker compose run archivebox init --setup
docker compose run archivebox manage createsuperuser

Access ArchiveBox at http://localhost:8000.

Viewing Your Archives

Web Interface

Browse your archived snapshots through the main web interface. Each snapshot shows:

  • Original URL and timestamp
  • Available archive formats (HTML, PDF, screenshot, etc.)
  • Tags and metadata
  • Links to individual archive files

Direct Filesystem Access

Archives are stored in /data/archive/<timestamp>/ folders. Each snapshot contains:

/data/archive/1699876543/
├── index.html # Archive index page
├── index.json # Metadata
├── screenshot.png # Full-page screenshot
├── output.pdf # PDF rendering
├── singlefile.html # Self-contained HTML
├── readability/ # Extracted article text
├── warc/ # WARC archive files
├── wget/ # wget mirror
│ └── example.com/
└── media/ # Downloaded media files

Troubleshooting

Chrome/Chromium Errors

If you see Chrome-related errors, the container may need more resources. Ensure your Klutch.sh deployment has adequate memory allocated.

Permission Issues

If archiving fails with permission errors, verify the PUID and PGID environment variables match the container’s expected user.

Large Archives

For very large archives, consider:

  • Disabling media downloads (SAVE_MEDIA=False) to save space
  • Increasing persistent volume size
  • Using external storage solutions for the /data directory

Sites Blocking Archiving

Many sites actively block bots. Try:

  • Setting custom user agents
  • Adding cookies for authentication
  • Enabling CHECK_SSL_VALIDITY=False for sites with certificate issues

Resources

Next Steps

After deploying ArchiveBox, consider:

  • Installing the browser extension for easy archiving
  • Setting up scheduled imports from RSS feeds
  • Configuring authentication for private archives
  • Exploring the REST API for automation
  • Setting up regular backups of your /data volume
  • Customizing which extractors are enabled based on your needs