cc-mrjob
cc-mrjob copied to clipboard
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
- based on an updated version of the warc library supporting Python 3, see https://github.com/commoncrawl/warc/ - optionally replaces by a compatibility-wrapper around [warcio](https://pypi.org/project/warcio/) or [fastwarc](https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html) Note: - needs more testing...
I had some problems running on AWS EMR with the default mrjob.conf. In case anyone else is running into similar issues, I found that I needed to make two minor...
This library does not run on Python 3.x: - along with minor syntax fixes - the [warc module](https://github.com/erroneousboat/warc3) dependency needs a change, cf. internetarchive/warc#25, internetarchive/warc#26, [erroneousboat/warc3](https://github.com/erroneousboat/warc3)
- work-around as this is unlikely to be implemented in the [warc module](/internetarchive/warc) - see also commoncrawl/cc-pyspark#3
Sunset the project `cc-mrjob` because - it builds upon Yelp's [mrjob](https://github.com/Yelp/mrjob/) which isn't maintained anymore - the latest release v0.7.4 happened 5 years ago - since then - there are...