commoncrawl topic

List commoncrawl repositories

CommonCrawlDocumentDownload

58
Stars
20
Forks
Watchers

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

commonCrawlParser

30
Stars
11
Forks
Watchers

Simple multi threaded tool to extract domain related data from commoncrawl.org

xurlfind3r

543
Stars
61
Forks
Watchers

A command-line interface (CLI) based passive URLs discovery utility. It is designed to efficiently identify known URLs of given domains by tapping into a multitude of curated online passive sources.

news-please

2.0k
Stars
405
Forks
Watchers

news-please - an integrated web crawler and information extractor for news that just works

news-crawl

253
Stars
31
Forks
Watchers

News crawling with StormCrawler - stores content as WARC

cc-pyspark

383
Stars
84
Forks
Watchers

Process Common Crawl data with Python and Spark

paskto

151
Stars
43
Forks
Watchers

Paskto - Passive Web Scanner

cdx_toolkit

153
Stars
29
Forks
Watchers

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

comcrawl

215
Stars
36
Forks
Watchers

A python utility for downloading Common Crawl data

cc-index-table

95
Stars
9
Forks
Watchers

Index Common Crawl archives in tabular format