common-crawl topic
xurlfind3r
A command-line interface (CLI) based passive URLs discovery utility. It is designed to efficiently identify known URLs of given domains by tapping into a multitude of curated online passive sources.
news-crawl
News crawling with StormCrawler - stores content as WARC
cc-pyspark
Process Common Crawl data with Python and Spark
comcrawl
A python utility for downloading Common Crawl data
ungoliant
:spider: The pipeline for the OSCAR corpus
cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
gerpt2
German small and large versions of GPT2.
cc-notebooks
Various Jupyter notebooks about Common Crawl data
cc-webgraph
Tools to construct and process webgraphs from Common Crawl data