html
html copied to clipboard
Unable to parse with non-UTF-8 charset
- SDK: 2.9.1
- html package: 0.14.0+3
Localized web-page containing following tag within its head won't be correctly decoded:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2" />
And there are few problems actually:
- To trigger any content-conversion logic, the
HtmlParser::parse()method needs to be called withinputparameter presented asList<int>orUint8List. Otherwise, when it's given as astringit will be always assumed as UTF-8 encoded, thus giving wrong texts. - Data above is currently ignored by
HtmlParsereven if passed asList<int>. Internally ContentAttrParser::parse() reads the unquotedcharsetcontent as an empty string. - Encoding-detection assumes it's located within first 512 bytes and this limit can't be changed via any parameter, still leading to
metatag skipped in some cases. - Even, if the buggy behavior is fixed, code crashes later in
html_input_stream.dartmethod _decodeBytes() as currently only UTF-8 and ASCII encodings are supported. I understand, that those are only two supported by Dart by now, but even there is no way to inject a own/custom decoder to handle this encoding and code ends up withArgumentError.
me too