CommonCrawler
CommonCrawler copied to clipboard
Preparing CommonCrawl .wet files via IPFS
Summary
CommonCrawler is easily accessible via AWS S3. However, I'm interested in creating some sort of IPFS based distribution of Common Crawl. This way we can self-host and create our own P2P network for seeding and distributing data.
Requirements
-
A website with an index that lists all the wet files. I can style it if you need help.
-
An easy to use JSON REST API that you can cURL data from.
Payment
TBD and is not in consideration in the near term. Will be hosting seed network under %eaxops infrastructure.