warc topic
CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
replayweb.page
Serverless replay of web archives directly in the browser
news-crawl
News crawling with StormCrawler - stores content as WARC
bitextor
Bitextor generates translation memories from multilingual websites
WarcDB
WarcDB: Web crawl data as SQLite databases.
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
wail
:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
ipwb
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS