This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Chrome / Chromium Setup
By default, ArchiveBox looks for any existing installed version of Chrome/Chromium and uses it if found. You can optionally install a specific version and set the environment variable CHROME_BINARY
to force ArchiveBox to use that one, e.g.:
CHROME_BINARY=google-chrome-beta
CHROME_BINARY=/usr/bin/chromium-browser
CHROME_BINARY='/Applications/Chromium.app/Contents/MacOS/Chromium'
CHROME_BINARY='~/Library/Caches/ms-playwright/chromium-857950/chrome-mac/Chromium.app/Contents/MacOS/Chromium'
If you don't already have Chrome installed, I recommend installing Chromium instead of Google Chrome, as it's the open-source fork of Chrome that doesn't send as much tracking data to Google.
Check for existing Chrome/Chromium install:
google-chrome --version | chromium-browser --version
Google Chrome 122.0.6261.49 beta # should be >v111
Installing Chromium
⭐️ Any OS (recommended)
playwright
(by the Microsoft team) and puppeteer
(by the Google team) are two options to get stable, repeatable Chromium distributions on many OSs.
pip install --upgrade --ignore-installed playwright
playwright install --with-deps chromium
# alternatively use puppeteer to get Chromium instead of playwright:
npm install puppeteer
macOS
If you already have a Chrome app installed like /Applications/Chromium.app
, you don't need to run this.
brew install --cask chromium
Ubuntu/Debian
If you already have chromium-browser
>= v111 installed (run chromium-browser --version
, you don't need to run this.
sudo apt update
sudo apt install chromium-browser
# or on some systems:
sudo apt install chromium
Installing Google Chrome
macOS
If you already have /Applications/Google Chrome.app
, you don't need to run this.
brew install --cask google-chrome
Ubuntu/Debian
If you already have google-chrome
>= v111 installed (run google-chrome --version
, you don't need to run this.
wget -q -O - 'https://dl-ssl.google.com/linux/linux_signing_key.pub' | sudo apt-key add -
echo 'deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main' | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome
Troubleshooting Chromium Install
If you encounter problems setting up Google Chrome or Chromium, see the Troubleshooting page.
Setting Up a Chromium User Profile
You may choose to set up a Chrome/Chromium user profile in order to use your cookies/sessions to log into sites behind authentication/paywall during archiving.
Note: not all extractors use Chrome (e.g. wget
, mercury
, media
), so COOKIES_FILE
should be set up as well after this.
[!WARNING] We strongly recommend you use separate burner credentials dedicated to archiving, e.g. don't provide cookies for your normal daily Facebook/Instagram/Google/etc. accounts as server responses and page content will often contain your name/email/PII, session cookies, private tokens, etc. which then get preserved in your snapshots for eternity.
Future viewers of your archive may be able to use any reflected archived session tokens to log in as you, or at the very least, associate the content with your real identity. Even if this tradeoff seems acceptable now or you plan to keep your archive data private, you may want to share a snapshot with others in the future, and snapshots are very hard to sanitize/anonymize after-the-fact!
For this reason, it's best to set up dedicated fake profile accounts for each site you want to archive, and consider them burned if you ever share any of your archived snapshots of those sites with untrusted people.
Docker VNC Setup
If using ArchiveBox in Docker, the easiest way to set up session credentials is by remote controlling the ArchiveBox Chrome browser over VNC, and using it to log in to the sites you want to save.
- Enable the
novnc
server using these settings in yourdocker-compose.yml
:
docker-compose.yml
:
services:
archivebox:
...
volumes:
...
- ./data/personas/Default:/data/personas/Default
environment:
- CHROME_USER_DATA_DIR=/data/personas/Default/chrome_profile
- DISPLAY=novnc:0.0
novnc:
image: theasp/novnc:latest
environment:
- DISPLAY_WIDTH=1920
- DISPLAY_HEIGHT=1080
- RUN_XTERM=no
ports:
- "8080:8080"
- Start the
novnc
window server container
docker compose up -d novnc
# wait a few seconds for novnc to start...
- Start ArchiveBox's Chrome inside Docker
docker compose run archivebox /usr/bin/chromium-browser --user-data-dir=/data/personas/Default/chrome_profile --profile-directory=Default --disable-gpu --disable-features=dbus --disable-dev-shm-usage --start-maximized --no-sandbox --disable-setuid-sandbox --no-zygote --disable-sync --no-first-run
(make sure you set DISPLAY
& CHROME_USER_DATA_DIR
and added the line to volumes:
above first!)
-
Open
http://localhost:8080/vnc.html
in your browser. You should see a remote linux desktop shown with Chrome open, allowing you to remote-control ArchiveBox's browser. Use it to log into any sites where you want to save credentials. -
✅ Close the browser, stop & remove novnc, and then run archivebox normally. It will use the profile stored in
CHROME_USER_DATA_DIR=/data/personas/Default/chrome_profile
going forward, you should now be able to archive sites as if you were logged in!
# stop the archivebox and novnc containers
docker compose down
docker compose down --remove-orphans
# edit docker-compose.yml to remove/comment out the novnc: section
# test it all out by archiving something hosted on one of the domains you logged in to
docker compose run archivebox add 'https://private.example.com/some/site/requiring/login.html'
# check the SingleFile, Screenshot, DOM, or PDF snapshot output (only these use the Chrome profile)
# make sure the content appears as your logged-in user would see it
Under the hood this uses Xvfb + Fluxbox + novnc
to provide a virtual display, window manager, and VNC server + novnc websocket viewer.
Non-Docker Setup (Local Host)
If running ArchiveBox on your local machine without Docker, this process is fairly easy.
First, tell archivebox where you want to store your Chrome profile.
# replace /Users/alice/.archivebox_chrome with a path to store your profile in
archivebox config --set CHROME_USER_DATA_DIR=/Users/alice/.archivebox_chrome
Then run Chrome (with that profile dir) to open a visible browser window where you can log into things, e.g.:
# find your CHROME_BINARY path by running
archivebox version | grep -i chrome
# macOS example (using Google Chrome.app)
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --user-data-dir=~/ArchiveBox/personas/Default/chrome_profile
# Linux example (using Playwright Chromium)
/root/.cache/ms-playwright/chromium-1105/chrome-linux/chrome --user-data-dir=~/archivebox/data/personas/Default/chrome_profile
Once it's open, log in to all the sites you want to be logged in to for archiving, then close/quit Chrome.
✅ All ArchiveBox extractors that use Chrome (e.g. Screenshot, PDF, DOM, Singlefile) should now use that profile.
Don't forget to set up COOKIES_FILE
for the rest!
Non-Docker Setup (Remote Host)
You must set up the profile using the exact same version of chrome that ArchiveBox is running (which can be found with archivebox version
).
You can download the latest chromium with pip install playwright && playwright install --with-deps chromium
, or get older versions of Chrome from https://chromium.cypress.io.
General steps:
- Make sure you are running the same OS and have the same version of Chrome installed as the host running ArchiveBox
- Follow the
Non-Docker Setup (Local Host)
setups above to create a Chrome profile locally - Rsync your chrome profile from your local machine to the remote archivebox host
rsync --archive /path/to/profile remotehost:/path/to/profile/on/remote/host
- Configure ArchiveBox on the remote host to use the
rsync
'ed Chrome profile
archivebox config --set CHROME_USER_DATA_DIR=/path/to/profile/on/remote/host
You may need to run chown -R archivebox /path/to/profile/on/remote/host
on the remote host to make the profile editable by the archivebox
user on that machine.
✅ All ArchiveBox extractors that use Chrome (e.g. Screenshot, PDF, DOM, Singlefile) should now use that profile.
Don't forget to set up COOKIES_FILE
for the rest!
More Info & Troubleshooting
- https://github.com/ArchiveBox/ArchiveBox/issues/952
- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#archiving-private-content
- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F
- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir
- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_binary
- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Getting Started
- 🔢 Quickstart
- 🖥️ Install
- 🐳 Docker
- ➡️ Supported Sources
- ⬅️ Supported Outputs
Usage
- ﹩Command Line
- 🌐 Web UI
- 🧩 Browser Extension
- 👾 REST API / Webhooks
- 📜 Python API / REPL / SQL API
Reference
Guides
- Upgrading
- Setting up Storage (NFS/SMB/S3/etc)
- Setting up Authentication (SSO/LDAP/etc)
- Setting up Search (rg/sonic/etc)
- Scheduled Archiving
- Publishing Your Archive
- Chromium Install
- Cookies & Sessions Setup
- Merging Collections
- Troubleshooting
More Info
- ⭐️ Web Archiving Community
- Background & Motivation
- Comparison to Other Tools
- Architecture Diagram
- Changelog & Roadmap