crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Add or consolidate "Resume after abort" guide

Open metalwarrior665 opened this issue 2 years ago • 2 comments

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

I'm missing a description of the abort/pause behavior with CTRL + C, and how it interacts with state, storage, and purging.

Maybe we can just consolidate the purge explanations which are scattered a bit: https://crawlee.dev/api/core/function/purgeDefaultStorages https://crawlee.dev/docs/guides/result-storage#cleaning-up-the-storages

Motivation

People are still asking about the storages, purging, resuming etc. so we need to generally make this more prominent in the docs.

Ideal solution or implementation, and any additional constraints

The guides and examples are getting a bit chunky.

Alternative solutions or implementations

No response

Other context

No response

metalwarrior665 avatar Feb 09 '23 22:02 metalwarrior665

It would be helpful to resume after a run completed with e.g. a request erroring out. I'd like to know how to retry that erroring request and potentially it'll all work and we can avoid re-processing everything before that.

An example may be because of rate limiting we failed. Well later on it may be unthrottled so resuming where left off would help.

I know this involves the Configuration.getGlobalConfig().set('purgeOnStart', false); (or equivalent env var). But something else is needed to (maybe mutating the request_queues/default/<entry>.json?

One way I started investigating, but still doesn't work is to manually look at the failed request in crawlee_storage/request_queues/default/<xxx>.json and rename that to archived_failure_<xxx>.json and then manually add a Request of that failed request <xxx> to the queue on resuming (with purgeOnStart=false).

This is approximating what we want but I'd expect there is a way to do this without manually editing the default/ directory and instead instruct it to retry the failures again.

jawspeak avatar Jul 14 '23 23:07 jawspeak

For people who read this in the future, my workaround to resume a failed run without manually adding a request was:

  • copy a file in the `request_queues/default/.json for backup, e.g. to archived_xxx.json, it will be ignored.
  • edit the originally named file and edit the top level retryCount to 0 or 1 (less than the max which we hit before), change orderNo to the current Date.now(), and edit the json string value in the json key, removing the ,\"handledAt\":\"2023-07-25T17:16:18.234Z\" portion of the string (make sure to get the comma and it to be valid escaped json when saved).
  • then resume with the purgeOnStart false and I observed the previously failed out queue resuming on that request.

jawspeak avatar Jul 27 '23 04:07 jawspeak