birt BIRT 4.14: HTML-Output: Text-content wrong parsed/Changed parsing behavior: element "TEXT", type "HTML", tag "script", content, script-string with internal "HTML-tag"

I figured out that the parsing of the element "TEXT" for text-type "HTML" is changed at behavior side which can cause wrong HTML-content for the HTML-output.

I created the TEXT-element for HTML with script-tag and a small script for special task later on HTML-output-level on viewer side:

<script>
var elemText = "<div>Line 01</div>";
console.log("elemText: " + elemText);
</script>

till version 4.13: string-parsing for HTML-output, OK: "" for output ""
since version 4.14: string-parsing for HTML-output, FAILED: "" for output "</di>"
If a space given between "slash" and "div" the string-parsing is correct: "" for output "</ div>"

I have tested the versions: 4.6, 4.8, 4.10, 4.12 - result: ok I have tested the version: 4.14, 4.15 milestone - result: with failor

But I cannot see changed at the involved classes: HTMLTextParser, HTMLParser, TextParser

String parsing: ok

text-html-ok

String parsing: failed

text-html-failed

Demo-report

text_html_script_parsing_birt_4.6.zip

Mar 16 '24 09:03 speckyspooky

Further investigaton shows me that the HTMLTextParser call for each TEXT-HTML-element the tidy-parser to convert the content tidy-tree to a DOM-tree.

But here is the problem, the input of my case is correct but the output of the parse-process shows the changed string and so the will be shown as "</di>" (= "</di>").

Any ideas welcome to avoid this parsing handling.

tidy-parsing

Mar 16 '24 10:03 speckyspooky

Disclaimer: This is just a wild guess, because I didn't really work on the HTML front end for more than 20 years.

Before the call to tidy.parseDom, there's a call to tidy.setXHTML(true). XHTML is slightly different from HTML. And maybe the parser works more pedantic or differently in that case regarding CDATA/PCDATA and the meaning of the special characters < and > in script elements.

I don't know if HTML or XML parsing code has changed inside BIRT itself (apart from CSS parsing, IIRC), but maybe the newer BIRT releases use newer releases of HTMLTidy and behavior has changed there?

BTW we don't see the generated output file completely, e.g. eveything before the body element. Browsers are often forgiving incorrect input, but is there a good HTML linter which you could use to see if the output fulfils the specification, for exampe if you use a script like this:

var elemText = '<div ';

Maybe the new behavior is correct and the old wasn't?

Mar 19 '24 07:03 hvbtup

I started different research of that topic and found the following page https://infohound.net/tidy/tidy.pl

There I was able to test my HTML-text phragment and of course the result is that the current behavior is the correct behavior and the old behavior is the wrong behavior which was fixed (seems to me).

I used another way to solve my topic.

And I will close the issue because we haven't to change anything.

Mar 20 '24 20:03 speckyspooky