zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Make a distinction between soft and hard limits

Open benoit74 opened this issue 8 months ago • 2 comments

We have three limits which can stop the crawler in the middle of a run:

  • --sizeLimit: the maximum warc size
  • --timeLimit: the maximum duration of the crawl
  • --diskUtilization: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reached

While the two first limits are used by zimit.kiwix.org to control a fair usage of the system, the third is usually set to 90% and just ensures we do not fill the disk (45% would make more sense in fact, since we need to double crawler disk usage to have enough space to create the ZIM, and this number does not even takes into account the fact that other tasks might be running at the same time and are sharing disk).

When a limit is reached, crawler returns code 11 ; zimit continues to create the ZIM, probably since limits are in general hit when we are in the zimit.kiwix.org scenario and we want to provide a ZIM (even incomplete) to the user.

This is a problem for tasks running on farm.openzim.org where we expect the ZIM to not be interrupted by a limit.

I suggest to reverse logic for "safety":

  • by default, zimit stops whenever a limit is reached
  • a new flag --continue-on-crawler-limits is added to keep current behavior
    • to be set only on zimit.kiwix.org (probably by zimit-frontend)

Ideally we should push it in 2.0 milestone since it is a breaking change.

benoit74 avatar May 27 '24 08:05 benoit74