ArchiveBot icon indicating copy to clipboard operation
ArchiveBot copied to clipboard

The viewer is sometimes missing files that are available on IA

Open JustAnotherArchivist opened this issue 8 years ago • 3 comments

Recently, I've noticed that the viewer sometimes misses files that are available on IA. Some examples (compare with the IA item listing):

  • http://archive.fart.website/archivebot/viewer/item/archiveteam_archivebot_go_20170820010001
  • http://archive.fart.website/archivebot/viewer/item/falconk_archivebot_d2mods_info_20170803
  • http://archive.fart.website/archivebot/viewer/item/falconk_archivebot_datacrystal_romhacking_net_20170802

Looking at the viewer source code (for the first time, so take this with a grain of salt), I think I saw a flaw: it only refreshes the file list for items which have a public_date (= date of last modification?) in the past three days. If for some reason the IA API requests fail for an extended period of time, for example due to network or power outages at IA (as was the case repeatedly in the past weeks), that would mean it'd miss the file list updates. I think it should base that filtering on something like the last successful retrieval date instead. I have no idea whether that's the reason for the specific cases mentioned above though.

JustAnotherArchivist avatar Aug 28 '17 22:08 JustAnotherArchivist

I guess looking at it now, it should be ordering by oai_updatedate. Keeping track of the last successful retrieval date is a good idea.

Edit: I tried running it on a new database and it turns out IA has a limitation on pagination (oops). It supports returning everything at once however. There's likely more incorrect use of their API.

chfoo avatar Aug 30 '17 02:08 chfoo

Just realised that I never mentioned this here:

As a workaround, I've created a git repo which lists all files in the ArchiveBot IA collection. You can search for e.g. domain names or the first five characters of a job ID to find all relevant files.

https://github.com/JustAnotherArchivist/archivebot-archives

JustAnotherArchivist avatar Jan 04 '18 01:01 JustAnotherArchivist

You can't sort by oai_updatedate, by the way, but you can filter by it: https://github.com/jjjake/internetarchive/issues/226

Also, although most people who were using it noticed it by now, for the record: my archivebot-archives repo has been on hold since November because it was slowly getting unmanageably large (one file per item in the collection). I had plans to revive it with a different backend, but I'll probably just try to work on the official viewer instead.

JustAnotherArchivist avatar Apr 26 '19 22:04 JustAnotherArchivist