common-crawl-quick-hacks icon indicating copy to clipboard operation
common-crawl-quick-hacks copied to clipboard

common crawl quick hack examples

some quick hacks using the common crawl dataset

links in metadata is an example of using hadoop streaming with a python script to extract links from the metadata set

finding names gives a quick overview of the textdata set and presents a simple NLTK app for extracting noun phrases (again python streaming)

url status codes shows how to run over the metadata set using java mapreduce to extract urls and the status codes the crawler received when crawling them