svalbard
svalbard copied to clipboard
Internet Archive Metadata Exporter
Need a toolchain for monitoring a IA collection and extracting the .CDX file manifests for each item on an ongoing basis, and convert the URL lists to NDJSON.
Here's my first stab, https://gist.github.com/maxogden/89818ba6f14ab95b9d6051fa14deeb74 but had bugs with the IA API that I didn't know how to fix (was getting duplicates back from the advancedsearch endpoint).
Found out the ia
tool in python works much better:
curl -LO https://archive.org/download/ia-pex/ia
chmod +x ia
ia --insecure search 'collection:EndOfTerm2016WebCrawls' --itemlist > eotitems.txt`
See also: https://github.com/jjjake/iamine https://archive.org/details/ia_census_201604