archiveis Behaviour of the `/submit/` endpoint

Behaviour of https://archive.md/submit/ endpoint has changed recently. Now it returns WIP page in Refresh header (https://archive.md/wip/Z6uhm) which contains page capture progress and expects client to retry until the page is captured and proper memento URL (https://archive.md/Z6uhm) returned via Location. This way archiveis.capture() always returns URL of the WIP page.

This can be fixed either by retrying until proper URL is available (and somehow handling errors if it is not) or just stripping /wip/ from URL and hoping for the best.

>>> archive_url = archiveis.capture("https://example.com")
DEBUG:archiveis.api:Requesting https://archive.md/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "GET / HTTP/1.1" 200 4997
DEBUG:archiveis.api:Unique identifier: QxbCURgTX9qqOlJsvO7Qnp6OpwoRYUx3YErVZz1eLx4aUht3+iuOB+6Ili4WD2Y2
DEBUG:archiveis.api:Requesting https://archive.md/submit/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "POST /submit/ HTTP/1.1" 200 244
DEBUG:archiveis.api:Memento from Refresh header: https://archive.md/wip/Z6uhm

May 01 '20 18:05 antonalekseev

Do you think stripping the /wip/ will work reliably?

May 05 '20 15:05 palewire

I reckon it will not be more unreliable than it was with current archiveis code and old-style (pre-wip-page) handling on the server side. Refresh: header was available as soon as Loading... page was, and it was returned by archiveis.capture() immediately and unconditionally. This way unsuccessful archivals in the cases of Error: time out., Error: Network error. and infinite Loading... were not handled anyway, and resulting link ultimately yielded 404. Stripping /wip/ should work the same way.

On the one hand bluntly ignoring errors is not an ideal approach, on the other hand waiting up to 3-5 minutes on each call is also not an option for many use cases. Maybe it makes sense to introduce some kind of archiveis.capture(..., strict=False) parameter which defaults to shortcut (and existing) behaviour, and optional strict=True mode which parses wip page for all kinds of errors and raises exceptions?

May 05 '20 20:05 antonalekseev

You have any idea on how we could implement this in the Python?

Dec 27 '20 23:12 palewire

archiveis archiveis copied to clipboard

Behaviour of the `/submit/` endpoint

archiveis
archiveis copied to clipboard