wp1 icon indicating copy to clipboard operation
wp1 copied to clipboard

Use static resource profile for requesting ZIM creation

Open audiodude opened this issue 9 months ago • 5 comments

Fixes #794.

Originally, we were trying to estimate how many resources we would need on Zimfarm based on the size of the selection. It seems the estimates were based on the idea that the max resource allocation of, say, 15 GB of memory would be enough to scrape almost all of English Wikipedia, so 3 GB would be enough to scrape 1M articles, or about 1/5 of that. This is obviously not right, and #790 makes it clear that there should be a certain maximum limit on the number of articles that can be requested through WP1.

Instead of doing any kind of heuristic calculation, we now just statically request the maximum amount of resources that we happen to know (from outside channels) the Zimfarm is able to support.

audiodude avatar Mar 23 '25 17:03 audiodude

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.47%. Comparing base (137335d) to head (acb6281). Report is 52 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #814      +/-   ##
==========================================
- Coverage   91.48%   91.47%   -0.02%     
==========================================
  Files          66       66              
  Lines        3548     3541       -7     
==========================================
- Hits         3246     3239       -7     
  Misses        302      302              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Mar 23 '25 17:03 codecov[bot]

@kelson42 I think we should discuss what is a fair usage of wp1. Being a free service, I wonder if we should apply any sort of limit (e.g. in terms of number of articles). This PR is opening the door to much bigger ZIMs, and I'm not sure this is really what we want (at least not without a safety net).

benoit74 avatar Mar 23 '25 20:03 benoit74

@kelson42 I think we should discuss what is a fair usage of wp1. Being a free service, I wonder if we should apply any sort of limit (e.g. in terms of number of articles). This PR is opening the door to much bigger ZIMs, and I'm not sure this is really what we want (at least not without a safety net).

I agree, see https://github.com/openzim/wp1/issues/790#issuecomment-2746295558

EDIT: In regards to "much bigger ZIMs", it's not clear that any even "realistically" sized ZIMs had ever been requested with WP1. The resource heuristic being replaced here seemed just completely broken. We should limit the size of ZIMs through other means (not the resource profile).

audiodude avatar Mar 23 '25 21:03 audiodude

We've discussed the matter live yesterday with kelson and the outcome is:

  • I must provide the updated heuristic about which memory / disk / cpu to assign ; it is still unclear if we need something depending on selection size or just a flat profile, but at least we do not want to use "maximum available on worker"
    • I will come back to you asap on this
  • I've opened https://github.com/openzim/wp1/issues/816 about the topic of limiting WP1 to reasonable requests

benoit74 avatar Mar 25 '25 09:03 benoit74

@audiodude Last time it worked, WPEN was scrapable in 16GB AND the memory consumption was related (if not proportional) to the number of articles. The current situation is the result of changes in MWoffliner.

kelson42 avatar Apr 04 '25 04:04 kelson42

Superseeded by #858

audiodude avatar May 22 '25 15:05 audiodude