cc-notebooks icon indicating copy to clipboard operation
cc-notebooks copied to clipboard

Various Jupyter notebooks about Common Crawl data

Jupyter Notebooks to Analyze Common Crawl Data

  • analyzing data using the columnar index
    • blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: net-blocking-iran-cc-main-2019-47.ipynb
    • total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the .edu top-level domain: cc-main-2013-2019-metrics.ipynb
    • correlations between character sets and lanuages: correlation-language-charset.ipynb
  • analyze the Common Crawl webgraph data sets and interactively explore the graphs: cc-webgraph-statistics
  • how to explore WARC files running a notebook on AWS EMR
  • truncated record payloads in WARC Files:
    • verify that all truncated payloads are annotated by the WARC-Truncated header
    • which MIME types are mostly affected by truncation? Aggregations using the columnar index.