common-crawl topic
List
common-crawl repositories
xurlfind3r
532
Stars
63
Forks
Watchers
A command-line interface (CLI) based passive URLs discovery utility. It is designed to efficiently identify known URLs of given domains by tapping into a multitude of curated online passive sources.
news-crawl
253
Stars
31
Forks
Watchers
News crawling with StormCrawler - stores content as WARC
cc-pyspark
383
Stars
84
Forks
Watchers
Process Common Crawl data with Python and Spark
comcrawl
215
Stars
36
Forks
Watchers
A python utility for downloading Common Crawl data
ungoliant
152
Stars
14
Forks
Watchers
:spider: The pipeline for the OSCAR corpus
cc-crawl-statistics
118
Stars
8
Forks
Watchers
Statistics of Common Crawl monthly archives mined from URL index files
goclassy
85
Stars
6
Forks
Watchers
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
gerpt2
18
Stars
0
Forks
Watchers
German small and large versions of GPT2.
cc-notebooks
40
Stars
8
Forks
Watchers
Various Jupyter notebooks about Common Crawl data
cc-webgraph
66
Stars
4
Forks
Watchers
Tools to construct and process webgraphs from Common Crawl data