ArchiveBot
ArchiveBot copied to clipboard
ArchiveBot, an IRC bot for archiving websites
Crawls of big websites can be slowed down by the sheer amount of tracking code and analytics code crud on every page. It would be nice to have an optional...
This is a _partial_ (!) alphabetical list of the kinds of ads and tracking code one might find on even a simple, not-very-highly-trafficked LiveJournal community. Would be great to have...
If you request a job with a URL list but the URL provided for that list is itself invalid the job will get stuck in an infinite loop retrying to...
Two weeks ago, I was playing around with alternatives for the lately very unstable transfer.sh and launched job 5ivs1btcd49uxizyj3nll0azt. Turns out that the pipeline did not like this at all:...
For example if a job downloads the URL https://media2.wnyc.org/i/%s/%s/%s/%s/1/toddheadshot.jpg you won't be able to click the link to open the URL in the browser or access any of the right-click...
Search for the ones I added to 864byfg6zp4pxi6pxr4a9p78k
Around 500 warcs in a single job, uploaded to a single bucket, we often see 403 errors uploading subsequent warcs. Have a selection of alternate bucket names to solve this,...
Sometimes the uploader will keep trying and failing to upload a single WARC. When this happens, it retries the same WARC constantly and does not advance to others, even if...
Recently, I've noticed that the viewer sometimes misses files that are available on IA. Some examples (compare with the IA item listing): * http://archive.fart.website/archivebot/viewer/item/archiveteam_archivebot_go_20170820010001 * http://archive.fart.website/archivebot/viewer/item/falconk_archivebot_d2mods_info_20170803 * http://archive.fart.website/archivebot/viewer/item/falconk_archivebot_datacrystal_romhacking_net_20170802 Looking at...