simple_html_css_flutter Parse error on seemingly valid HTML

I am trying to run the following HTML through the toRichText function: Age (use days if <1 month, months if <1 year):

This results in:

flutter: simple_html_css Exception: Closure: () => String from Function 'toString':.: <!DOCTYPE expected at 1:21
flutter: simple_html_css Stack Trace: #0      XmlEventIterator.moveNext (package:xml/src/xml_events/iterator.dart:35:9)
flutter: #1      Parser.parse (package:simple_html_css/src/internals.dart:150:34)
flutter: #2      HTML.toTextSpan (package:simple_html_css/src/html_stylist.dart:83:21)

I am getting the same result if I am using < instead of the <.

Mar 31 '21 21:03 tiloc

Sorry for the delay. I'm a little busy these days. I'll look into it as soon as I have the time.

Apr 23 '21 08:04 ali-thowfeek

@tiloc Sorry as of now there are no easy ways around this. If you tried to fix it one way, it breaks the other way. This issue needs some more time put in and might involve quite a bit of refactoring which I unfortunately cant afford at the moment. This issue has to do with the two dependencies of this package. htm_unescape and xml. You can see a similar case in this issue here in the xml repo.

For your use case If you know that starting angle brackets (<) of html tags wont be escaped as < then you can somewhat by pass this error yourself by forking this package and adding some code, which i can help with if thats the case.

Apr 24 '21 18:04 ali-thowfeek

Ok, thanks for the background info. Unfortunately, I also have very little time I can invest. So I guess this will just be a known limitation for the time being.

Apr 24 '21 21:04 tiloc

+1

Feb 16 '23 14:02 MaximilianFlechtner

+1

Mar 01 '23 08:03 ObranS

@ali-thowfeek Hello. We recently ran into this issue as well when using this package.

From what we could see the problem is apparently in this line when using HtmlUnescape. If we remove the unescaping, everything works as expected (at least from our side).

- final Parser parser = Parser(context, HtmlUnescape().convert(content),
+ final Parser parser = Parser(context, content,

The example string in the issue comment Age (use days if <1 month, months if <1 year): would then be passed as-is to the parser instead of being converted to Age (use days if <1 month, months if <1 year):.

That said, we see this is the only place where html_unescape is being used in the package, so in all likelihood the unescaping is being done on purpose, and we're not sure if this would end up causing some unintended behaviors.

Would it be fine to remove the unescaping here in a PR?

Oct 20 '23 03:10 fjwong

Hi @fjwong Its been quite a while since I worked on this. I see your problem. In that case instead of removing the unescaping part would you be able to create a PR which makes it unescape by default (So that it stays backward compatible) and by passing an optional boolean flag you could disable unescaping. Although this will not solve the issue, this would at least give a choice to bypass it CC: @nohli any comments?

Oct 20 '23 09:10 ali-thowfeek

If it's a bug, we could think about a breaking change and removing it.

The bugfix should be the default imho.

Any idea why the package uses html_unescape? Does removing it break functionality?

Oct 20 '23 09:10 nohli

@nohli my initial requirements had a lot of text with html escaped characters thus I included it within the package. But removing it won't actually break anything. Only thing is we would have to pass unescaped strings for rendering, as RichText doesn't unescape by itself.

Nevertheless removing unescaping would solve this error only if the string which has < > is escaped as < >, still in this case the characters: < and > and any other escaped characters are not going to be rendered correctly rather it'll be rendered as < and > etc. And in the case where they are not escaped, we would still have this error even after removing unescaping. Thus I believe unescaping is not the cause of this issue.

Angle brackets are always going to be a problem. Unless if someone could find another way.

One possible way, which I haven't tried is enclosing any occurrence of angle brackets with CDATA like below: <![CDATA[Age (use days if < 1 month, months if < 1 year):]]> This is valid XML. And we can have unescaping around, if this would work. And it would solve this issue fully.

Oct 20 '23 10:10 ali-thowfeek

@ali-thowfeek @nohli I've submitted a PR with the changes suggested in https://github.com/ali-thowfeek/simple_html_css_flutter/issues/17#issuecomment-1772417562.

Unescaping is done by default, but please let me know if you would prefer to have this inverted, or have any further changes in the PR.

Nevertheless removing unescaping would solve this error only if the string which has < > is escaped as < >, still in this case the characters: < and > and any other escaped characters are not going to be rendered correctly rather it'll be rendered as < and > etc.

I guess the concern is with other escaped entities such as × (if present in the html content) which would end up not being rendered as expected if unescaping is removed, since these are escaped entities in HTML but not in XML, and thus ignored by the parser.

I'm not sure if this would be a better or worse alternative, but I guess we could keep unescaping and just handle < and > separately like:

content = content.replaceAll('&lt;', '&amp;lt;'); // After unescaping, this would be converted back to &lt; to be readable by the XML parser
content = content.replaceAll('&gt;', '&amp;gt;');

(For example, with the current implementation Age (use days if <1 month, months if <1 year): can be rendered as intended if it is instead written as Age (use days if &lt;1 month, months if &lt;1 year):. The above would just do this replacement under the hood.)

It feels a bit hacky-ish though, and I'm not sure if this would result in other unintended behaviors.

One possible way, which I haven't tried is enclosing any occurrence of angle brackets with CDATA like below: <![CDATA[Age (use days if < 1 month, months if < 1 year):]]> This is valid XML. And we can have unescaping around, if this would work. And it would solve this issue fully.

Just in case, I tested doing this on my side and, while the content appears to be parsed as valid XML and doesn't throw any errors, the text content within CDATA is not rendered at all.

In any case, please let me know if you have any feedback on the matter.

Oct 22 '23 02:10 fjwong

Fixed by https://github.com/ali-thowfeek/simple_html_css_flutter/pull/29 available on v4.0.1 And unescaping will be remove in the upcoming Major release to avoid tight coupling.

Also the issue is fixed in the xml dependency: https://github.com/renggli/dart-xml/issues/123 hence, removing unescaping from my package will allow to render texts such as this: Age (use days if <1 month, months if <1 year): without issues.

Yet a known issue is still around where, if the text contains an unescaped < within the tags, it'll still fail because of the strictness of xml.

May 19 '24 06:05 ali-thowfeek

simple_html_css_flutter simple_html_css_flutter copied to clipboard

Parse error on seemingly valid HTML

simple_html_css_flutter
simple_html_css_flutter copied to clipboard