ArchiveBot
ArchiveBot copied to clipboard
ArchiveBot, an IRC bot for archiving websites
`^https://discord\.com/assets/` 100s of urls that waste time and resources because they probably get captured on cursory crawls anyway. 
As I understand it, jobs are currently started without concurrency or delay settings, and those are later set by the settings monitor. This means that a job always starts at...
travis-ci.org was discontinued earlier this month, so the tests no longer run. Rather than switching to another proprietary platform (like travis-ci.com or GitHub Actions) that will be changing again in...
When ArchiveBot hits a .swf file, it should decompile it and search for URLs in the ActionScript. This may be tricky to implement, but it would fix most problems that...
`DownloadUrlFile` does not verify that the server responded with an HTTP 200. This morning, there was an issue, which lead to lots of errors and occasional 502s. The latter were...
I've noticed that sometimes, URLs are not retried properly. The most recent example is job 172fw8g4egszevx4i56uu06cm. One of about 1700 such URLs on that job: ``` $ zstdgrep -F 'https://usc.gov.mm/?q=node/66'...
>NOTE: I am going to do it myself but since I forgot to bring my laptop today this is a reminder to do it later today when I can work...
If it detects it is a mediawiki wiki, it should go to Special:AllPages Ex: https://apple.fandom.com/wiki/Special:AllPages
While global deduplication for everything in ArchiveBot is not feasible, we should consider adding something for certain URLs that waste a lot of disk space, probably shouldn't be ignored entirely,...
Cf. #490 and #491 Environment variables on the preflight test are not modified, but inside the pipeline, only the selected ones specified in `wpull_env` in `pipeline.py` are passed to wpull....