benoit74
benoit74
It looks like iframes are not rewritten at all currently in mwoffliner. We should. See https://github.com/openzim/zim-requests/issues/1471#issuecomment-3043489876 This iframe comes from the Wikitext itself:
Currently, default speed seems to induce quite a lot of "pressure" on mediawikis. From someone at minecraft.wiki: > The way they are scraping it is really bad/resource intensive Like they...
In https://github.com/openzim/zim-requests/issues/1260#issuecomment-3324636008, we've faced an infinite loop while following continue parameters. I don't know if this is worth it, but some kind of logic detecting that we are in such...
Looks like many wikis are using some sort of dynamic thing to add classes to support light/dark themes. See e.g. https://github.com/openzim/mwoffliner/issues/2416#issuecomment-3047990568 It could help to have a scraper option to...
https://www.mediawiki.org/wiki/MediaWiki_Language_Extension_Bundle https://www.mediawiki.org/wiki/Special:MyLanguage/Extension:Translate The `Translate` extension adds `Special:MyLanguage` links which needs to be properly remapped inside the ZIM And we probably want to automatically create one ZIM per language to not...
https://wiki.restarters.net uses a specially crafted skin: `chameleon`, see https://wiki.restarters.net/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=general|skins Nice enhancement for openzim/zim-requests#1357
When we have multiple level of redirects in place, not all redirects are added correctly to the ZIM. I've made a test case on wiki.kiwix.org: `MWoffliner_Tests/Test4` is redirecting to `MWoffliner_Tests/Test3`...
Currently, requests to action=parse endpoint (e.g. https://mdwiki.org/w/api.php?action=parse&format=json&prop=modules%7Cjsconfigvars%7Cheadhtml%7Ctext%7Cdisplaytitle%7Csubtitle&usearticle=1&disableeditsection=1&disablelimitreport=1&page=2%2C3%2C5%2C6-Tetramethoxyphenethylamine&useskin=vector&redirects=1&formatversion=2) are not cached at all. This query is used to retrieve, for a given article, its text, headhtml, display title, subtitle and list...
Currently, the scraper push to the ZIM all articles, no matter what their contentmodel is. I feel like this is wrong, by default the scraper should scrape only `wikitext` contentmodel,...
I think that we should enhance two things regarding logging in mwoffliner, but they are breaking changes. First, we should prefer to use "standard" log levels. Currently we use `'info',...