ArchiveBot issues

Add Discord assets to global ignoreset

`^https://discord\.com/assets/` 100s of urls that waste time and resources because they probably get captured on cursory crawls anyway. ![image](https://user-images.githubusercontent.com/289437/126982929-24cdab9e-b8bf-4118-927d-cb3b986dedb5.png)

Sanqui

Apply delay settings immediately at the start of a job

As I understand it, jobs are currently started without concurrency or delay settings, and those are later set by the settings monitor. This means that a job always starts at...

JustAnotherArchivist

enhancement

pipeline

Migrate test suite to Drone

travis-ci.org was discontinued earlier this month, so the tests no longer run. Rather than switching to another proprietary platform (like travis-ci.com or GitHub Actions) that will be changing again in...

JustAnotherArchivist

Extract URLs from SWFs

4

When ArchiveBot hits a .swf file, it should decompile it and search for URLs in the ActionScript. This may be tricky to implement, but it would fix most problems that...

PressStartandSelect

enhancement

pipeline

upstream

Status code of URL list downloads isn't checked

`DownloadUrlFile` does not verify that the server responded with an HTTP 200. This morning, there was an issue, which lead to lots of errors and occasional 502s. The latter were...

JustAnotherArchivist

bug

pipeline

URLs are sometimes not retried correctly

I've noticed that sometimes, URLs are not retried properly. The most recent example is job 172fw8g4egszevx4i56uu06cm. One of about 1700 such URLs on that job: ``` $ zstdgrep -F 'https://usc.gov.mm/?q=node/66'...

JustAnotherArchivist

bug

investigation

pipeline

Add tistory to blogs ignoreset

1

>NOTE: I am going to do it myself but since I forgot to bring my laptop today this is a reminder to do it later today when I can work...

revi

enhancement

ignores

Detect MediaWiki and retrieve Special:AllPages

1

If it detects it is a mediawiki wiki, it should go to Special:AllPages Ex: https://apple.fandom.com/wiki/Special:AllPages

upintheairsheep

enhancement

pipeline

Global deduplication for specific URLs

3

While global deduplication for everything in ArchiveBot is not feasible, we should consider adding something for certain URLs that waste a lot of disk space, probably shouldn't be ignored entirely,...

JustAnotherArchivist

enhancement

backend

pipeline

Preflight test does not have the same environment variables as wpull inside the pipeline

Cf. #490 and #491 Environment variables on the preflight test are not modified, but inside the pipeline, only the selected ones specified in `wpull_env` in `pipeline.py` are passed to wpull....

JustAnotherArchivist

bug

pipeline

ArchiveBot
ArchiveBot copied to clipboard

Metadata

Add Discord assets to global ignoreset

Apply delay settings immediately at the start of a job

Migrate test suite to Drone

Extract URLs from SWFs

Status code of URL list downloads isn't checked

URLs are sometimes not retried correctly

Add tistory to blogs ignoreset

Detect MediaWiki and retrieve Special:AllPages

Global deduplication for specific URLs

Preflight test does not have the same environment variables as wpull inside the pipeline

← Metadata

Owner

Metadata

ArchiveBot ArchiveBot copied to clipboard

Metadata

← Metadata

Owner

Metadata

ArchiveBot
ArchiveBot copied to clipboard