zim-tools Invalid internal link error raised for template within <script> tag

zimcheck raises Invalid internal link errors as follows for a image link which is actually a template inside a

- <%= largeSRC %>
(A/course/core-english-01/topic-5-my-feelings-and-myself/unit-1-dealing-with-different-emotions/step-2/</largeSRC /) were not found in article A/course/core-english-01/topic-5-my-feelings-and-myself/unit-1-dealing-with-different-emotions/step-2/index.html

The script tag within which this template is as follows -

<script id="image-modal-tpl" type="text/template">
    <div class="wrapper-modal wrapper-modal-image">
  <section class="image-link">
    <%= smallHTML%>
    <a href="#" class="modal-ui-icon action-fullscreen" role="button">
      <span class="label">
        <span class="icon fa fa-arrows-alt fa-large" aria-hidden="true"></span> <%- gettext("Fullscreen") %>
      </span>
    </a>
  </section>

  <section class="image-modal">
    <section class="image-content">
      <div class="image-wrapper">
        <img alt="<%= largeALT %>, <%- gettext('Large') %>" src="<%= largeSRC %>" />
      </div>

      <a href="#" class="modal-ui-icon action-close" role="button">
        <span class="label">
          <span class="icon fa fa-remove fa-large" aria-hidden="true"></span> <%- gettext("Close") %>
        </span>
      </a>

      <ul class="image-controls">
        <li class="image-control">
          <a href="#" class="modal-ui-icon action-zoom-in" role="button">
            <span class="label">
              <span class="icon fa fa fa-search-plus fa-large" aria-hidden="true"></span> <%- gettext("Zoom In") %>
            </span>
          </a>
        </li>

        <li class="image-control">
          <a href="#" class="modal-ui-icon action-zoom-out is-disabled" aria-disabled="true" role="button">
            <span class="label">
              <span class="icon fa fa fa-search-minus fa-large" aria-hidden="true"></span> <%- gettext("Zoom Out") %>
            </span>
          </a>
        </li>
      </ul>
    </section>
  </section>
</div>

</script>

This results in many false errors as this is present in nearly all HTML files in PHZH ZIMs

Aug 19 '20 11:08 satyamtg

@satyamtg thank you for reporting this issue. Can you please provide the problematic zimfile?

Aug 19 '20 15:08 MiguelRocha

@MiguelRocha you can find a ZIM here - https://farm.openzim.org/pipeline/5f3cf02c5684802b46244c2a

Aug 19 '20 16:08 satyamtg

For future reference: phzh_core-english-one_en_2020-08.zim

Aug 23 '20 08:08 MiguelRocha

@rgaudin @MiguelRocha @satyamtg It seems difficult from the zimcheck perspective to easily make the difference beetween a template and a "noirmal" HTML page. IMO We are missing here some kind of meta information or something else is wrong. I wonder for example:

Is this a js template?
why we have template at all here?
why this is in considered as a normal HTML article? in A/? With the std text/html mime-type?

Aug 27 '20 06:08 kelson42

@rgaudin @MiguelRocha @satyamtg It seems difficult from the zimcheck perspective to easily make the difference beetween a template and a "noirmal" HTML page. IMO We are missing here some kind of meta information or something else is wrong. I wonder for example:

Is this a js template?

why we have template at all here?

why this is in considered as a normal HTML article? in A/? With the std text/html mime-type?

@kelson42 this is scraped directltly from the instance during fetching of all CSS/JS (i.e. all script, style and link (with rel="stylesheet") tags from the HTML) to apply the instance style on the offline version. It's considered as a normal HTML and is in the A/ namespace because its written within the HTML in a <script> tag, and is not in a separate file.

Aug 27 '20 16:08 satyamtg

How is this feature handled in zimcheck? Is this through an actual DOM parser or via regexp? Might be real tricky if the latter.

Aug 28 '20 08:08 rgaudin

@rgaudin This is a regex parser

Aug 28 '20 08:08 kelson42

Agree with @rgaudin, here the core of the problem is that we have a regex based parser and we should have a DOM based.

Jan 29 '21 07:01 kelson42

@maneeshpm Might be a good candidate for you, replacing the functions which retrieve the link with a DOM (pugixml) parser.

Jan 29 '21 07:01 kelson42

Thanks @kelson42! Looking into this issue.

Jan 29 '21 07:01 maneeshpm

@maneeshpm Thank you very much. Ifyou have other tasks ongoing, please try to finish them first.

Jan 29 '21 07:01 kelson42

@kelson42 Sure, they are almost done(pending final review and merge). I will start this as soon as they are closed.

Jan 29 '21 08:01 maneeshpm

@kelson42 To make sure I understand the issue correctly, the expected behavior is to extract all src and href attributes. The current function generic_getLinks(), while doing this also gathers some src inside the script tags which causes failure in zimcheck. So we want to rewrite this function using pugixml such that all src and href attributes of nodes that are not child of a script node(but can be an attribute of script node itself) are collected. Is this correct?

Jan 31 '21 15:01 maneeshpm

@kelson42 Parsing with pugixml also causes some errors. Since the file we are trying to parse is not exactly XML, some elements inside the

non closed tags: <%= smallHTML%>, <%- gettext("Fullscreen") %>
invalid attributes: "dismiss": '<a aria-label="dismiss cookie message" id="dismiss" role=button tabindex="2" class="cc-btn cc-dismiss:focus">{{dismiss}}</a>', button should be inside quotes.

If we can get rid of these, pugixml works well in tree_walker traverse mode. Exploring workarounds.

Feb 02 '21 13:02 maneeshpm

@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed.

Feb 04 '21 10:02 kelson42

Got this problem again with a different, simpler use case. The ZIM is at http://tmp.kiwix.org/beer_meta.zim

When using zimcheck -U on it, it will raise a Not Found Article for static/css/mobile.css. You shouldn't worry about the rest of the errors nor the reason it's not finding the article. The fact that this is mentioned only within a <script /> is the problem here.

Actually, AFAIK, zimcheck raises also for <blockquote />, <pre /> and <code /> all of those should be excluded.

Jul 01 '21 15:07 rgaudin

@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed.

Gumbo (https://github.com/google/gumbo-parser) is probably a better parser to parse html. Fortunately, it is already a dependencies of zim-tools (through zimwriterfs), so we just have to use it :)

Jul 07 '21 14:07 mgautierfr

This looks interesting, an html5 specific parser like Gumbo is more suited for our need than depending on an XML based parser like Pugixml.

Jul 07 '21 14:07 maneeshpm

This ticket is clearly blocked by #331

Mar 23 '23 19:03 kelson42