Add or consolidate "Resume after abort" guide
Which package is the feature request for? If unsure which one to select, leave blank
None
Feature
I'm missing a description of the abort/pause behavior with CTRL + C, and how it interacts with state, storage, and purging.
Maybe we can just consolidate the purge explanations which are scattered a bit: https://crawlee.dev/api/core/function/purgeDefaultStorages https://crawlee.dev/docs/guides/result-storage#cleaning-up-the-storages
Motivation
People are still asking about the storages, purging, resuming etc. so we need to generally make this more prominent in the docs.
Ideal solution or implementation, and any additional constraints
The guides and examples are getting a bit chunky.
Alternative solutions or implementations
No response
Other context
No response
It would be helpful to resume after a run completed with e.g. a request erroring out. I'd like to know how to retry that erroring request and potentially it'll all work and we can avoid re-processing everything before that.
An example may be because of rate limiting we failed. Well later on it may be unthrottled so resuming where left off would help.
I know this involves the Configuration.getGlobalConfig().set('purgeOnStart', false); (or equivalent env var). But something else is needed to (maybe mutating the request_queues/default/<entry>.json?
One way I started investigating, but still doesn't work is to manually look at the failed request in crawlee_storage/request_queues/default/<xxx>.json and rename that to archived_failure_<xxx>.json and then manually add a Request of that failed request <xxx> to the queue on resuming (with purgeOnStart=false).
This is approximating what we want but I'd expect there is a way to do this without manually editing the default/ directory and instead instruct it to retry the failures again.
For people who read this in the future, my workaround to resume a failed run without manually adding a request was:
- copy a file in the `request_queues/default/
.json for backup, e.g. to archived_xxx.json, it will be ignored. - edit the originally named file and edit the top level
retryCountto 0 or 1 (less than the max which we hit before), changeorderNoto the currentDate.now(), and edit the json string value in thejsonkey, removing the,\"handledAt\":\"2023-07-25T17:16:18.234Z\"portion of the string (make sure to get the comma and it to be valid escaped json when saved). - then resume with the
purgeOnStartfalse and I observed the previously failed out queue resuming on that request.