cc-index-table
cc-index-table copied to clipboard
Add IP column to Athena table for reverse IP search with `WARC-IP-Address` data
Historical hostname -> IP and IP -> hostname (reverse IP) datasets are currently quite hard to come by: https://opendata.stackexchange.com/questions/1951/dataset-of-domain-names the only super convenient methods being websites such as https://viewdns.info/reverseip/ which are expensive and have undocumented methodology.
Would it be possible to add an IP column to Athena that tracks WARC-IP-Address
? If we had that, it would be trivial for someone to export that data at relatively low cost from Common Crawl and make it available for all to use on a CSV file hosted on GItHub for example.
Such data can be of great value for OSINT purposes, e.g. I needed it in this project: https://cirosantilli.com/cia-2010-covert-communication-websites
There is a tool made for this apparently: https://github.com/CAIDA/commoncrawl-host-ip-mapper but I don't think it can run quickly/cheaply, the tabular approach would really be ideal here.