svalbard icon indicating copy to clipboard operation
svalbard copied to clipboard

Internet Archive Metadata Exporter

Open max-mapper opened this issue 8 years ago • 1 comments

Need a toolchain for monitoring a IA collection and extracting the .CDX file manifests for each item on an ongoing basis, and convert the URL lists to NDJSON.

Here's my first stab, https://gist.github.com/maxogden/89818ba6f14ab95b9d6051fa14deeb74 but had bugs with the IA API that I didn't know how to fix (was getting duplicates back from the advancedsearch endpoint).

Found out the ia tool in python works much better:

curl -LO https://archive.org/download/ia-pex/ia
chmod +x ia
ia --insecure search 'collection:EndOfTerm2016WebCrawls' --itemlist > eotitems.txt`

max-mapper avatar Feb 15 '17 23:02 max-mapper

See also: https://github.com/jjjake/iamine https://archive.org/details/ia_census_201604

bnewbold avatar Sep 21 '17 04:09 bnewbold