warc topics

CommonCrawlDocumentDownload

58

Stars

20

Forks

Watchers

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

centic9

cdx-files

commoncrawl

java

mime-types

ArchiveBox

21.7k

Stars

1.2k

Forks

Watchers

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

ArchiveBox

archivebox

backups

bookmark-archiver

browser-bookmarks

replayweb.page

688

Stars

55

Forks

Watchers

Serverless replay of web archives directly in the browser

webrecorder

replay-web-page

service-worker

warc

wayback-machine

news-crawl

253

Stars

31

Forks

Watchers

News crawling with StormCrawler - stores content as WARC

commoncrawl

apache-storm

common-crawl

commoncrawl

crawler

bitextor

287

Stars

43

Forks

Watchers

Bitextor generates translation memories from multilingual websites

bitextor

apertium

bicleaner

corpus-generator

corpus-processing

WarcDB

384

Stars

11

Forks

Watchers

WarcDB: Web crawl data as SQLite databases.

Florents-Tselai

cli

crawling

database

sqlite

grab-site

1.3k

Stars

125

Forks

Watchers

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

ArchiveTeam

archiving

crawl

crawler

spider

conifer

1.5k

Stars

118

Forks

Watchers

Collect and revisit web pages.

Rhizome-Conifer

wail

345

Stars

32

Forks

Watchers

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

machawk1

gui

heritrix

openwayback

pyinstaller

ipwb

596

Stars

39

Forks

Watchers

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

oduwsdl

docker

ipfs

memento

memento-rfc