warc2zim Zimit2: Is there an HTML parser issue with some special characters?

Given following HTML code:

<img width="200" src="image.png?param1=value1&param2=value2" />

Current HTML parser (in warc2zim2 branch) decode the src tag as image.png?param1=value1¶m2=value2, i.e. it consider that &para is an entity reference even if it has not trailing ; and decodes it as ¶.

I've found https://github.com/python/cpython/issues/69426 which is more or less related, but did not saw much progress since years. One person on this issue states that the URL with &para is an invalid one, but I'm not convinced at all, at least W3C validator (e.g. https://validator.w3.org/nu/#textarea) is happy with following HTML document:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="utf-8">
    <title>Characters encoding in a ZIM</title>
</head>

<body>
<img width="200" alt="" src="image.png?param1=value1&param2=value2">
</body>
</html>

This issue still has to be further analyzed, but I wonder if we should switch to another HTML parser. This could even help us benefit from more advanced parsing functionality which could help the scraper cope with poorly written HTML document (some parsers are better than others at recovering from bad code when building the tree).

Mar 18 '24 10:03 benoit74

Edit: updated first comment URL, I simplified only one of the two URLs ...

Mar 18 '24 10:03 benoit74

Briefly (I'm travelling) it looks like an over-enthusiastic parser, as it shouldn't be interpreting HTML entities in URLs, let alone in querystrings. (There is the possibility that our querystring obfuscation is somehow also involved.) Also, entities without ; are surely not entities?!

Mar 20 '24 09:03 Jaifroid

There is the possibility that our querystring obfuscation is somehow also involved.

Bug happens with just the HTML code I pushed in first commit with a bare html parser, no link with any manipulation done in warc2zim.

Also, entities without ; are surely not entities?!

This is a debatable point of view according to Python folks, see upstream issue

Mar 21 '24 07:03 benoit74

We need anyway a better documentation of HTML rewriter at src/warc2zim/content_rewriting/html.py ; all functions are missing docstrings

Apr 29 '24 20:04 benoit74

The specifications

In HTML, we are supposed to inherit from XML constraints saying that &, < and > in attributes values and data must be escaped. E.g. & must be written &. At least this is what https://stackoverflow.com/a/7382028 suggests, but it lacks references to HTML standards.

On another hand, https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 and https://html.spec.whatwg.org/multipage/syntax.html#character-references seems to clearly indicate that the ampersand is allowed in attribute value if it is not ambiguous ; otherwise it must be encoded e.g. with &.

So it means that:

<img src="image.png?param1=value1&param2=value2" /> is supposed to be ok and url should be interpreted as image.png?param1=value1&param2=value2
<img src="image.png?param1=value1&param;2=value2" /> is not ok because param is not a named character reference so it is ambiguous (but it is probably not a valid URL). To not be ambiguous it should have been written as <img src="image.png?param1=value1&param;2=value2" />
<img src="image.png?param1=value1¶2=value2" /> is ok but url is image.png?param1=value1¶m2=value2
<img src="image.png?param1=value1&param2=valu;e2" /> is also ok because the ; is after a non-alphanumeric, so not subject to ambiguities, so url should be interpreted as image.png?param1=value1&param2=value2
<img src="image.png?param1=value1&toutou;2=valu;e2" /> is also ok because the ; is after a non-alphanumeric

In other words: it is a mess! Not speaking about the legacy which most probably exist. I suggest that the good decision would be to always write such HTML as <img src="image.png?param1=value1&param2=value2" /> (always replace & character found in attributes values with a &) as suggested in StackOverflow.

Back to python

https://github.com/python/cpython/issues/69426 is exactly about our current problem, but as someone says on the issue, it is not clear that the solution is adequate since the link provided as example (close to mine) is not clearly 100% valid (it matches exactly the arguments above).

In any case, current implementation in Python 3.12 (and before), the html parser is always calling html.unescape on attributes values.

What's next

Since the Python html parser is always calling html.unescape, it is quite clear that in reverse we must always call html.escape on the attribute values, this is the bare minimum to produce valid HTML code. But it clearly means we've introducing more modifications inside the HTML code.

We however have no choice, because always calling html.unescape on attribute values is the right choice. This is needed because we might have for instance an image named ima"ge.png , and in this case the ZIM Path will look like ima"ge.png but the HTML code should be something like <img src="ima&QUOT;ge.png" /> for proper processing.

Changing to another parser like lxml parsers is not an option because these parsers are even more intrusive in how they interpret the HTML, fixing even more errors found in HTML than Python parser e.g. automatically adding missing tags, ...

Proposed actions

Document clearly this limitation in the README with a link to this issue / comment
Escape the HTML attribute values before pushing them back to HTML

May 16 '24 15:05 benoit74

@benoit74 There is a difference (IMHO) between an ampersand that is in a parameter value and an ampersand that is a separator. Separators are well defined in one of the specs (I'll try to find it later, have a student coming now) and it could be pretty problematic to encode separators except for encoding the entire querysring (which is a decision we've already taken).

In the example you give image.png?param1=value1&param2=value2 the separators are ?, =, and & (used to separate different parameter values). We enter great confusion territory if we confuse separators and what might be inside the values of the parameters. E.g. we might well have Don_Quijote.html?type=book&extrainfo=Prologue%20by%20Vergara%20&%20Vergara (in that case &amp is part of the extrainfo parameter's value...

However, in practice, I think ampersand would be rendered in such cases as %26 to avoid all confusion...

Isn't it simply the case that we need to pass the querystring in whatever format it is through encodeURIComponent()? Then I can simply decode it and get the correct string.

May 16 '24 15:05 Jaifroid

@Jaifroid your reasoning is at the URL level, i.e. you supposed that we know that this specific HTML attribute is a URL. This is not our case. In the HTML parsing, we are purely at the basic HTML level (close to XML somehow) and we just now we have a tag, and this tag has an attribute with a given name and value. The fact that this value is an URL is not known at all at this stage. The parser only needs to decide which value will be presented to our code. As far as I've tested, Python parser and lxml parsers (XML, HTML and HTML5) all decode the attribute value from any &xx.

Making HTML parsing more intelligent by knowing which attributes are supposed to contain a URL could be one option ... not a cheap one obviously.

Escape the HTML attribute values before pushing them back to HTML

In fact, escaping back the HTML attribute value is already done in the format_attr function, I misread this part of the codebase, so only documentation is missing from my PoV

May 16 '24 15:05 benoit74

OK, thanks. My only concern is that we can ultimately get something valid and vaguely well formatted out of whatever goes into the ZIM, since ultimately it's browsers that have to read the HTML. I keep forgetting that we don't really care what the URL looks like so long as the backend can map the strings to a specific entry in the ZIM. So, the only consideration is that the browser won't choke on whatever comes out. I have to say that having & inside a query string is something I've never seen and somehow offends my sense of what is correct formatting, but presumably what we'll actually get is %26amp%3B... (so that the full querystring is passed on encoded by Kiwix Serve...) Sorry I haven't been very helpful... Just trying to get my head round it. 🤖

May 16 '24 17:05 Jaifroid

I have to say that having & inside a query string is something I've never seen and somehow offends my sense of what is correct formatting

You are not supposed to really see this, the HTML parser you will use will already take care of that decoding right at the moment the content is read. This encoding is purely a file formatting issue, it has nothing to do with the "real" content.

For instance at the HTTP level, the request for fetching an image or following a link will all be made without this & encoded, this will have been removed way before.

But the same would happen for any other HTML tag attribute which may contain for instance single and double quote, most HTML attributes are not URLs. For instance an attribute named data-test with a value my "value" is 'secret' will be written in the HTML file as data-test="my "value" is 'secret'".

This is exactly why writing HTML "by hand" is a bad idea, you should at least use an HTML encoder which will take care of these subtleties. But no worries, I still write HTML "by hand" ^^ (and then I say that Python HTML parser is buggy ^^)

Just trying to get my head round it. 🤖

Me too (it took me 4 hours, so no worries), and I think that your questions help to make this issue clearer

May 16 '24 19:05 benoit74

Thanks for your kind words, @benoit74!

the HTML parser you will use will already take care of that decoding right at the moment the content is read

This is the bit where we have to be careful because KJS apps don't use the same stack as all the other Kiwix apps. However, I think you are talking in generic terms, not implying I need to run a parser in my JS backend that converts any of this. Our parser is the browser and all I need to take care of (hopefully) is that all URLs in Zimit2 should be sent to the backend as is (i.e. URI-encoded), whereas in Zimit1, I have to pass all URLs through decodeURIComponent() before sending them on.

May 17 '24 07:05 Jaifroid

This is the bit where we have to be careful because KJS apps don't use the same stack as all the other Kiwix apps.

Yep.

Our parser is the browser and all I need to take care of (hopefully) is that all URLs in Zimit2 should be sent to the backend as is (i.e. URI-encoded), whereas in Zimit1, I have to pass all URLs through decodeURIComponent() before sending them on.

I don't know tbh. I assume the parser will already have decoded the & and stuff like this. I will push such a test case to the test website so that it will be easier to test. Regarding the need to call decodeURIComponent(), I would say it depends again on what the browser already decodes for you. There is plenty a test cases on the test website, so it will be easy to assess.

But what I'm sure we want to achieve is that you should be able to do the same as you do for all other scrapers. No special handling of Zimit2 ZIMs.

May 17 '24 09:05 benoit74

In fact we already have the case on the test website with this image URL:

Source code of HTML page:

Firefox inspector of HTML page:

Firefox raw response in network of HTML page:

We see that Firefox parser has already replaced the & by a &

May 17 '24 12:05 benoit74

@benoit74 Thank you very much for the test cases. Once we have a test ZIM, I'll also test manually on the platforms we support (roughly, Chrome, Chrome 58+, Edge, Edge Legacy 17+, IE11, Firefox 60+, Safari 11.3+).

May 17 '24 13:05 Jaifroid

Fixed by https://github.com/openzim/warc2zim/pull/259

May 21 '24 09:05 benoit74