mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Add JavaScript option

Open Markus-Rost opened this issue 4 months ago • 19 comments

Fix #2310

  • Add --javaScript option with values "none", "trusted" or "all" (default being "trusted").
  • Add --addModules option for additional ResourceLoader modules

Markus-Rost avatar Aug 18 '25 19:08 Markus-Rost

This draft is currently made as a proof of concept with a disregard for any other renderer it might be breaking.

Markus-Rost avatar Aug 18 '25 19:08 Markus-Rost

I created some example ZIM files including no JavaScript, all JavaScript or only JavaScript trusted not to request external content. Please take a look at them and let me know how you feel about them.

  • All JavaScript disabled
  • Only trusted JavaScript enabled
  • Trusted JavaScript enabled, as well as some additional user created scripts not requesting external content
    • wikipedia_en_10_JS_trusted_addModules_2025-08.zip
    • mwoffliner --mwUrl=https://en.wikipedia.org/ --articleList=http://download.openzim.org/wp1/enwiki/tops/10.tsv --javaScript=trusted --addModules=site,ext.gadget.ReferenceTooltips,oojs,oojs-ui,ext.gadget.switcher,ext.gadget.Calculator,@wikimedia/codex
  • All JavaScript enabled, including scripts requesting external content

Markus-Rost avatar Aug 18 '25 20:08 Markus-Rost

I tried PoC ZIMs.

For some reasons, the wikipedia_en_10_JS_trusted_addModules_2025-08.zim is causing a significant problem to my kiwix-serve / kiwix-apple readers. Both simply fails to load the ZIM main article (or any other article in kiwix-serve). Surprisingly I rebuild a ZIM with same command and few other articles and it works fine. Don't know what is broken, but something is.

All javascripts version is obviously causing problems in the web console ... but nothing really visible in the articles.

The "trusted javascript" works very well.

I'm not sufficiently export to assess the list of trusted modules, or the one of "addModules" version, but this is more a detail we can refine over the years anyway.

The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough.

I've tested with JS disabled in Chrome and I confirm the ZIM degrades nicely (i.e. it looks like we get the same content than currently without this PR merged).

All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:

  • we should probably have a --nojavascript option instead of a --javascript option, for those willing to not have any JS (even if ZIM with JS seems to degrade nicely in browsers with JS disabled, pretty sure some folk would prefer to not have any JS code in their ZIMs)
  • we should maybe add more scripts to the trusted list

@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it.

benoit74 avatar Aug 21 '25 10:08 benoit74

For some reasons, the wikipedia_en_10_JS_trusted_addModules_2025-08.zim is causing a significant problem to my kiwix-serve / kiwix-apple readers. Both simply fails to load the ZIM main article (or any other article in kiwix-serve). Surprisingly I rebuild a ZIM with same command and few other articles and it works fine. Don't know what is broken, but something is.

Oh, I'm getting that as well. Not sure what happened there. I tested it locally using 100.tsv which worked fine and created the 10.tsv version to fit the GitHub upload limit, but didn't actually test that one again.

The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough.

Yes, those tables need the site module.

All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:

Do you mean having all JS included by default? That doesn't seem like a good idea as some JS not expecting a request to fail can easily break stuff on the page.

  • we should probably have a --nojavascript option instead of a --javascript option, for those willing to not have any JS (even if ZIM with JS seems to degrade nicely in browsers with JS disabled, pretty sure some folk would prefer to not have any JS code in their ZIMs)

I've already added --javaScript=none, does no js really need it's own option?

  • we should maybe add more scripts to the trusted list

I've looked through basically all the modules I could find loaded on Wikipedia articles as well as a few other wikis and those are all the modules which won't try to load any non-scraped content. Sadly all three MathJax extensions either load external JS or from an extension path we can't easily rewrite or guess ahead of time. @benoit74 let me know about any specific scripts to add you have in mind and I'll take a look at them.

Markus-Rost avatar Aug 21 '25 12:08 Markus-Rost

OK, then what I meant is having --javascript=trusted as default. If we have --javascript=none, this is OK.

And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the site modules, and since (correct me if I'm wrong) we could have anything in the site module, then we can't do that. Or maybe we could add the site modules automatically only for wikimedia projects... not sure it is such a good idea.

benoit74 avatar Aug 21 '25 12:08 benoit74

OK, then what I meant is having --javascript=trusted as default. If we have --javascript=none, this is OK.

--javaScript=trusted is already the default in this PR

And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the site modules, and since (correct me if I'm wrong) we could have anything in the site module, then we can't do that. Or maybe we could add the site modules automatically only for wikimedia projects... not sure it is such a good idea.

Yeah, the site module can contain anything and change at any time. We also can't just include it for all Wikimedia projects, as on some wikis it might load external content (like on es.wikipedia.org).

Markus-Rost avatar Aug 21 '25 13:08 Markus-Rost

@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it.

@benoit74 Oh please take a look at the code. I assume that I've broken all renderers besides ActionParse, though I'm also not familiar at all with those renderers and have no idea how badly broken they are. I'm certainly gonna need some help with getting those issues fixed.

Markus-Rost avatar Aug 22 '25 01:08 Markus-Rost

Apologies for the delay. I've now tested these archives in Kiwix JS Browser Extension, with the same results as @benoit74 reported.

  • The JS_all shows errors in console relating to failures to fetch some scripts (but I'm not actually seeing blocked calls to external resources).
  • JS_trusted shows a bunch of failures on the landing page, but not on other pages I tested.
  • JS_none also shows failures on the landing page (clearly references to modules not included in the archive have not been removed), and also on article pages a failure to find C/_mw_/startup.js.
  • JS_trusted_addModules fails to find the landing page, but article pages are fine.

I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page.

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

Jaifroid avatar Aug 22 '25 08:08 Jaifroid

Regarding the core issue of calling external scripts or resources https://github.com/openzim/mwoffliner/issues/2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

Currently the scraper has indeed more or less solved #2310 by removing all JS, so that we de-facto do not load anymore external scripts. But this is not really a long term solution. This is why we kept #2310 open saying this was more an interim fix and the medium term solution is this PR, adding back "just enough" scripts.

benoit74 avatar Aug 22 '25 10:08 benoit74

  • The JS_all shows errors in console relating to failures to fetch some scripts (but I'm not actually seeing blocked calls to external resources).

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

The calling of external scripts seen in #2310 is mostly just the attempted loading of JS modules. By rewriting startup.js to load those modules from C/_mw_/MODULE.js, they are now the failures to fetch some scripts you noticed in JS_all (though a lot of the called scripts are included in the ZIM).

Of course JS_all can still have scripts attempting to load external resources, just that en.wikipedia is a bad example for that because they have very little JS doing that in the first place. Getting JS_all on es.wikipedia for example will load external scripts through the site module as mentioned in an earlier comment.

  • JS_trusted shows a bunch of failures on the landing page, but not on other pages I tested.
  • JS_none also shows failures on the landing page (clearly references to modules not included in the archive have not been removed), and also on article pages a failure to find C/_mw_/startup.js.
  • JS_trusted_addModules fails to find the landing page, but article pages are fine.

I have the suspicion that my filtering of JS modules stopped the already existing landing page JS from being included 😅 Ah, I'll fix JS_none including startup.

I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page.

You should notice small things like the navboxes at the bottom of articles being collapsed by default now or the video/audio player popup on "Michael Jackson", but yes there aren't a lot of differences in formatting as en.wikipedia does not rely as heavily on JS as other wikis might.

Markus-Rost avatar Aug 22 '25 11:08 Markus-Rost

@kelson42 we need your decision on this one as well, it sat idle for way too long.

I again see no reason to be worried about the inline scripts, because as stated by @Markus-Rost the ZIM will anyway fallback to no-JS even if we remove these inline scripts (because startup.js also requires eval which is blocked in same context than inline scripts).

I tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-desktop (it was your main concern) and it works like a charm, including all JS-based functions (like show/hide at the bottom of "Michael Jackson" page).

Taking measures to remove these inline scripts without providing any added value to the end users looks like a waste of our precious resources and a risk to introduce bugs.

The no-JS fallback is working exactly as-of today (without this PR merged), i.e. the ZIM will still be high-quality enough.

I understand you would prefer to provide same ZIM behavior on all readers, but at some point I feel like it is important to recognize that the effort is not worth it.

I've also opened https://github.com/openzim/overview/issues/62 to discuss the general policy for openZIM

benoit74 avatar Oct 02 '25 07:10 benoit74

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

benoit74 avatar Oct 02 '25 07:10 benoit74

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:

image

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

Jaifroid avatar Oct 02 '25 09:10 Jaifroid

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:

image

I'm only seeing the two CSP violations from the two inline scripts existing. These are expected (and can be reduced to just one).

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid

Markus-Rost avatar Oct 02 '25 10:10 Markus-Rost

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

Oh, I found where this is coming from. It's the sourceMappingURL for modules loaded from local storage. Should be fairly easy to just remove that annotation, though the source map is only loaded by the browser dev tools anyway.

Markus-Rost avatar Oct 02 '25 12:10 Markus-Rost

This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid

@Markus-Rost Just to confirm that I only do these tests in the Browser Extension, which takes a purist approach and only uses the resources supplied in the ZIM except (currently) for an added dark CSS if the user selects that (to be changed soon to native dark). (Even in the PWA, I never supply JS, only CSS for some other transforms.)

Jaifroid avatar Oct 02 '25 13:10 Jaifroid

I'm having less time to work on this now, but I would like to avoid this PR becoming stale. @benoit74 can you review the PR further (besides the inline script existing), so that I have some pointers to work on whenever I'm available?

Markus-Rost avatar Oct 10 '25 17:10 Markus-Rost

What about DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP? Looks like these were "useful" things at some point, but indeed never called since years since the module name was not correct so startup.js was never fixed.

I'm unsure what DO_PROPAGATION was supposed to solve. It removed a short timeout before each modules got loaded. I suspect the reason was to avoid using mw.requestIdleCallback because the mw class wasn't properly initialized before. Regarding the reason of the timeout, the MW script has a comment about it:

 		// Yield for two reasons:
		// * Allow successive calls to mw.loader.impl() from the same
		//   load.php response, or from the same asyncEval() to be in the
		//   propagation batch.
		// * Allow the browser to breathe between the reception of
		//   module source code and the execution of it.
		//
		// Use a high priority because the user may be waiting for interactions
		// to start being possible. But, first provide a moment (up to 'timeout')
		// for native input event handling (e.g. scrolling/typing/clicking).
		mw.requestIdleCallback( doPropagation, { timeout: 1 } );

ALL_READY_FUNCTION just disabled the check if all dependency modules are loaded. Again likely existed because the resource loader didn't work fully, but is no longer needed as the resource loader now works as intended inside the ZIM.

No idea what LOAD_PHP was intended to do, it does not match anything on any recent startup module (besides possibly nuking half the minified module code due to an overly agressive regex)

Markus-Rost avatar Oct 13 '25 16:10 Markus-Rost

OK then this PR should completely remove DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP for the time being, no need to keep in codebase stuff we do not really understand and does not seem to provide much added value.

benoit74 avatar Oct 13 '25 16:10 benoit74