mwoffliner
mwoffliner copied to clipboard
Better detect / recover on rate limiting
Command used to repro:
mwoffliner --webp --mwUrl="https://www.appropedia.org" --format="novid" --verbose="log" --publisher="openZIM" --adminEmail="[email protected]" --customZimTitle="Test" --customZimLanguage="eng" --customZimDescription="Test"
Version: 1.14.1-dev0
After few articles, the scraper gets rate-limited (note that it is using the VisualEditor renderer, I do not get such errors with the RestApi one):
[info] [2025-02-06T08:08:58.013Z] Getting JSON from [https://www.appropedia.org/w/api.php?action=parse&format=json&prop=modules%7Cjsconfigvars%7Cheadhtml&formatversion=2&page=World+Shelters]
[error] [2025-02-06T08:08:58.018Z] Error downloading article How_to_install_FLIR_Lepton_Thermal_Camera_and_applications_on_Raspberry_Pi/ja
[error] [2025-02-06T08:08:58.018Z] {
code: 'rest-rate-limit-exceeded',
info: 'A rate limit was exceeded. Please try again later.',
'error-keys': [ 'actionthrottledtext' ],
docref: 'See https://www.appropedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.'
}
[error] [2025-02-06T08:08:58.019Z] Failed to run mwoffliner after [29s]: {}
[log] [2025-02-06T08:08:58.019Z] Exiting with code [2]
The error happens in modules retrieval (which is done before retrieving the article itself).
It escape the logic already in place around 429 HTTP errors, because this is a 200 HTTP response, with error field in response body (which content is displayed inside the logs).
I think we should have a mechanism to detect these rate limit errors as well. I struggle to find proper documentation on Mediawiki about which codes should be handled.
Hi! I tried running your command to reproduce, but got that error about some pages being private. So I added --articleList=https://www.appropedia.org/scripts/generateKiwixList.php to run only for non-private pages, and then I got the error you describe. I digged into it and after much testing, I figured out that the problem was with the limit on the "stashbasehtml" action set by $wgRateLimits by default, which has to do with the VisualEditor renderer. So I added unset( $wgRateLimits['stashbasehtml']['ip'] ); to our LocalSettings.php to remove that limit, and now the command seems to run successfully on our server!
After some thought and way more experience with mwoffliner, I don't think we should do anything here. In general, these rate limits can be easily overcome by logging into the wiki, so just slowing down the scraper because you're not logger in is not doing it much service.
We could advise user to provide login credentials to have less strict API limits. But I don't see how it could be done wisely in a code-based approach, this is way better in documentation / wiki.
Any suggestion on what to do in the code here, or shall I just add a wiki entry and close this issue?