Invalid internal link error raised for template within <script> tag
zimcheck raises Invalid internal link errors as follows for a image link which is actually a template inside a
- <%= largeSRC %>
(A/course/core-english-01/topic-5-my-feelings-and-myself/unit-1-dealing-with-different-emotions/step-2/</largeSRC /) were not found in article A/course/core-english-01/topic-5-my-feelings-and-myself/unit-1-dealing-with-different-emotions/step-2/index.html
The script tag within which this template is as follows -
<script id="image-modal-tpl" type="text/template">
<div class="wrapper-modal wrapper-modal-image">
<section class="image-link">
<%= smallHTML%>
<a href="#" class="modal-ui-icon action-fullscreen" role="button">
<span class="label">
<span class="icon fa fa-arrows-alt fa-large" aria-hidden="true"></span> <%- gettext("Fullscreen") %>
</span>
</a>
</section>
<section class="image-modal">
<section class="image-content">
<div class="image-wrapper">
<img alt="<%= largeALT %>, <%- gettext('Large') %>" src="<%= largeSRC %>" />
</div>
<a href="#" class="modal-ui-icon action-close" role="button">
<span class="label">
<span class="icon fa fa-remove fa-large" aria-hidden="true"></span> <%- gettext("Close") %>
</span>
</a>
<ul class="image-controls">
<li class="image-control">
<a href="#" class="modal-ui-icon action-zoom-in" role="button">
<span class="label">
<span class="icon fa fa fa-search-plus fa-large" aria-hidden="true"></span> <%- gettext("Zoom In") %>
</span>
</a>
</li>
<li class="image-control">
<a href="#" class="modal-ui-icon action-zoom-out is-disabled" aria-disabled="true" role="button">
<span class="label">
<span class="icon fa fa fa-search-minus fa-large" aria-hidden="true"></span> <%- gettext("Zoom Out") %>
</span>
</a>
</li>
</ul>
</section>
</section>
</div>
</script>
This results in many false errors as this is present in nearly all HTML files in PHZH ZIMs
@satyamtg thank you for reporting this issue. Can you please provide the problematic zimfile?
@MiguelRocha you can find a ZIM here - https://farm.openzim.org/pipeline/5f3cf02c5684802b46244c2a
For future reference: phzh_core-english-one_en_2020-08.zim
@rgaudin @MiguelRocha @satyamtg It seems difficult from the zimcheck perspective to easily make the difference beetween a template and a "noirmal" HTML page. IMO We are missing here some kind of meta information or something else is wrong. I wonder for example:
- Is this a js template?
- why we have template at all here?
- why this is in considered as a normal HTML article? in
A/? With the stdtext/htmlmime-type?
@rgaudin @MiguelRocha @satyamtg It seems difficult from the
zimcheckperspective to easily make the difference beetween a template and a "noirmal" HTML page. IMO We are missing here some kind of meta information or something else is wrong. I wonder for example:
- Is this a js template?
- why we have template at all here?
- why this is in considered as a normal HTML article? in
A/? With the stdtext/htmlmime-type?
@kelson42 this is scraped directltly from the instance during fetching of all CSS/JS (i.e. all script, style and link (with rel="stylesheet") tags from the HTML) to apply the instance style on the offline version. It's considered as a normal HTML and is in the A/ namespace because its written within the HTML in a <script> tag, and is not in a separate file.
How is this feature handled in zimcheck? Is this through an actual DOM parser or via regexp? Might be real tricky if the latter.
@rgaudin This is a regex parser
Agree with @rgaudin, here the core of the problem is that we have a regex based parser and we should have a DOM based.
@maneeshpm Might be a good candidate for you, replacing the functions which retrieve the link with a DOM (pugixml) parser.
Thanks @kelson42! Looking into this issue.
@maneeshpm Thank you very much. Ifyou have other tasks ongoing, please try to finish them first.
@kelson42 Sure, they are almost done(pending final review and merge). I will start this as soon as they are closed.
@kelson42 To make sure I understand the issue correctly, the expected behavior is to extract all src and href attributes. The current function generic_getLinks(), while doing this also gathers some src inside the script tags which causes failure in zimcheck.
So we want to rewrite this function using pugixml such that all src and href attributes of nodes that are not child of a script node(but can be an attribute of script node itself) are collected.
Is this correct?
@kelson42 Parsing with pugixml also causes some errors. Since the file we are trying to parse is not exactly XML, some elements inside the
- non closed tags:
<%= smallHTML%>,<%- gettext("Fullscreen") %> - invalid attributes:
"dismiss": '<a aria-label="dismiss cookie message" id="dismiss" role=button tabindex="2" class="cc-btn cc-dismiss:focus">{{dismiss}}</a>',button should be inside quotes.
If we can get rid of these, pugixml works well in tree_walker traverse mode. Exploring workarounds.
@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed.
Got this problem again with a different, simpler use case. The ZIM is at http://tmp.kiwix.org/beer_meta.zim
When using zimcheck -U on it, it will raise a Not Found Article for static/css/mobile.css. You shouldn't worry about the rest of the errors nor the reason it's not finding the article. The fact that this is mentioned only within a <script /> is the problem here.
Actually, AFAIK, zimcheck raises also for <blockquote />, <pre /> and <code /> all of those should be excluded.
@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed.
Gumbo (https://github.com/google/gumbo-parser) is probably a better parser to parse html. Fortunately, it is already a dependencies of zim-tools (through zimwriterfs), so we just have to use it :)
This looks interesting, an html5 specific parser like Gumbo is more suited for our need than depending on an XML based parser like Pugixml.
This ticket is clearly blocked by #331