ArchiveBot
                                
                                 ArchiveBot copied to clipboard
                                
                                    ArchiveBot copied to clipboard
                            
                            
                            
                        The viewer is sometimes missing files that are available on IA
Recently, I've noticed that the viewer sometimes misses files that are available on IA. Some examples (compare with the IA item listing):
- http://archive.fart.website/archivebot/viewer/item/archiveteam_archivebot_go_20170820010001
- http://archive.fart.website/archivebot/viewer/item/falconk_archivebot_d2mods_info_20170803
- http://archive.fart.website/archivebot/viewer/item/falconk_archivebot_datacrystal_romhacking_net_20170802
Looking at the viewer source code (for the first time, so take this with a grain of salt), I think I saw a flaw: it only refreshes the file list for items which have a public_date (= date of last modification?) in the past three days. If for some reason the IA API requests fail for an extended period of time, for example due to network or power outages at IA (as was the case repeatedly in the past weeks), that would mean it'd miss the file list updates. I think it should base that filtering on something like the last successful retrieval date instead. I have no idea whether that's the reason for the specific cases mentioned above though.
I guess looking at it now, it should be ordering by oai_updatedate. Keeping track of the last successful retrieval date is a good idea.
Edit: I tried running it on a new database and it turns out IA has a limitation on pagination (oops). It supports returning everything at once however. There's likely more incorrect use of their API.
Just realised that I never mentioned this here:
As a workaround, I've created a git repo which lists all files in the ArchiveBot IA collection. You can search for e.g. domain names or the first five characters of a job ID to find all relevant files.
https://github.com/JustAnotherArchivist/archivebot-archives
You can't sort by oai_updatedate, by the way, but you can filter by it: https://github.com/jjjake/internetarchive/issues/226
Also, although most people who were using it noticed it by now, for the record: my archivebot-archives repo has been on hold since November because it was slowly getting unmanageably large (one file per item in the collection). I had plans to revive it with a different backend, but I'll probably just try to work on the official viewer instead.