warc topic

List warc repositories

CommonCrawlDocumentDownload

58
Stars
20
Forks
Watchers

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

ArchiveBox

20.1k
Stars
1.1k
Forks
172
Watchers

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

replayweb.page

635
Stars
52
Forks
Watchers

Serverless replay of web archives directly in the browser

news-crawl

253
Stars
31
Forks
Watchers

News crawling with StormCrawler - stores content as WARC

bitextor

283
Stars
43
Forks
Watchers

Bitextor generates translation memories from multilingual websites

WarcDB

384
Stars
11
Forks
Watchers

WarcDB: Web crawl data as SQLite databases.

grab-site

1.3k
Stars
125
Forks
Watchers

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

conifer

1.5k
Stars
118
Forks
Watchers

Collect and revisit web pages.

wail

345
Stars
32
Forks
Watchers

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

ipwb

596
Stars
39
Forks
Watchers

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS