ArchiveBox/archivebox/Architecture.md

5.7 KiB

ArchiveBox UI

Page: Getting Started

What do you want to capture?

  • Save some URLs now -> [Add page]

    • Paste some URLs to archive now
    • Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
    • Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)
  • Import URLs from a browser -> [Import page]

    • Desktop: Get the ArchiveBox Chrome/Firefox extension
    • Mobile: Get the ArchiveBox iOS App / Android App
    • Upload a bookmarks.html export file
    • Upload a browser_history.sqlite3 export file
  • Import URLs from a 3rd party bookmarking service -> [Sync page]

    • Pocket
    • Pinboard
    • Instapaper
    • Wallabag
    • Zapier, N8N, IFTTT, etc.
    • Upload a bookmarks.html export, bookmarks.json, RSS, etc. file
  • Archive URLs on a schedule -> [Schedule page]

  • Archive an entire website -> [Crawl page]

    • What starting URL/domain?
    • How deep?
    • Follow links to external domains?
    • Follow links to parent URLs?
    • Maximum number of pages to save?
    • Maximum number of requests/minute?
  • Crawl for URLs with a search engine and save automatically

  • Some URLs on a schedule

  • Save an entire website (e.g. https://example.com)

  • Save results matching a search query (e.g. "site:example.com")

  • Save a social media feed (e.g. https://x.com/user/1234567890)


Crawls App

  • Archive an entire website -> [Crawl page]
    • What are the seed URLs?
    • How many hops to follow?
    • Follow links to external domains?
    • Follow links to parent URLs?
    • Maximum number of pages to save?
    • Maximum number of requests/minute?

Scheduler App

  • Archive URLs on a schedule -> [Schedule page]
    • What URL(s)?
    • How often?
    • Do you want to discard old snapshots after x amount of time?
    • Any filter rules?
    • Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]
  • Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)

    • 1 minute
    • 5 minutes
    • 1 hour
    • 1 day
    • Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
      • Tags
      • Persona
      • Created By ID
      • Config
      • Filters
        • URL patterns to include
        • URL patterns to exclude
        • ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago

Sources App (For managing sources that ArchiveBox pulls URLs in from)

  • Add a new source to pull URLs in from (WIZARD)

    • Choose URI:
      • Web UI
      • CLI
      • Local filesystem path (directory to monitor for new files containing URLs)
      • Remote URL (RSS/JSON/XML feed)
      • Chrome browser profile sync (login using gmail to pull bookmarks/history)
      • Pocket, Pinboard, Instapaper, Wallabag, etc.
      • Zapier, N8N, IFTTT, etc.
      • Local server filesystem path (directory to monitor for new files containing URLs)
      • Google drive (directory to monitor for new files containing URLs)
      • Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
      • AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
      • XBrowserSync (login to pull bookmarks)
    • Choose extractor
      • auto
      • RSS
      • Pocket
      • etc.
    • Specify extra Config, e.g.
      • credentials
      • extractor tuning options (e.g. verify_ssl, cookies, etc.)
  • Provide credentials for the source

    • API Key
    • Username / Password
    • OAuth

Alerts App

  • Create a new alert, choose condition

    • Get notified when a site goes down (<x% success ratio for Snapshots)
    • Get notified when a site changes visually more than x% (screenshot diff)
    • Get notified when a site's text content changes more than x% (text diff)
    • Get notified when a keyword appears
    • Get notified when a keyword dissapears
    • When an AI prompt returns some result
  • Choose alert threshold:

    • any condition is met
    • all conditions are met
    • condition is met for x% of URLs
    • condition is met for x% of time
  • Choose how to notify: (List[AlertDestination])

    • maximum alert frequency
    • destination type: email / Slack / Webhook / Google Sheet / logfile
    • destination info:
      • email address(es)
      • Slack channel
      • Webhook URL
  • Choose scope:

    • Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)

      • All extractors
      • Only screenshots
      • Only readability / mercury text
      • Only video
      • Only html
      • Only headers
    • Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)

      • All domains
      • Specific domain
      • All domains in a tag
      • All domains in a tag category
      • All URLs matching a certain regex pattern
    • Choose crawl scope: (a query that returns Crawl.objects QuerySet)

      • All crawls
      • Specific crawls
      • crawls by a certain user
      • crawls using a certain persona

class AlertDestination(models.Model): destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.] maximum_frequency filter_rules credentials alert_template: JINJA2 json/text template that gets populated with alert contents