common-crawl topic

List common-crawl repositories

xurlfind3r

532
Stars
63
Forks
Watchers

A command-line interface (CLI) based passive URLs discovery utility. It is designed to efficiently identify known URLs of given domains by tapping into a multitude of curated online passive sources.

news-crawl

253
Stars
31
Forks
Watchers

News crawling with StormCrawler - stores content as WARC

cc-pyspark

383
Stars
84
Forks
Watchers

Process Common Crawl data with Python and Spark

comcrawl

215
Stars
36
Forks
Watchers

A python utility for downloading Common Crawl data

ungoliant

152
Stars
14
Forks
Watchers

:spider: The pipeline for the OSCAR corpus

cc-crawl-statistics

118
Stars
8
Forks
Watchers

Statistics of Common Crawl monthly archives mined from URL index files

goclassy

85
Stars
6
Forks
Watchers

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

gerpt2

18
Stars
0
Forks
Watchers

German small and large versions of GPT2.

cc-notebooks

40
Stars
8
Forks
Watchers

Various Jupyter notebooks about Common Crawl data

cc-webgraph

66
Stars
4
Forks
Watchers

Tools to construct and process webgraphs from Common Crawl data