html icon indicating copy to clipboard operation
html copied to clipboard

Incorrect spans generated for HTML with higher-plane unicode characters

Open filiph opened this issue 7 years ago • 4 comments

When parsing HTML that includes characters like "🍋", the start and end FileLocations are generated incorrectly.

Here's a short repo:

import 'package:html/dom.dart';
import "package:html/parser.dart";
import "package:source_span/source_span.dart";

void main() {
  final dom = parse(contents,generateSpans: true);
  final Element element = dom.querySelectorAll("link").single;
  final span = element.sourceSpan;
  final spanCopy = new SourceSpan(span.start, span.end, contents);
}

const contents = """
<head>
    <meta charset="UTF-8">
    <title></title>
    <link rel="alternate" type="application/rss+xml" title="ArtLung &raquo; Limones 🍋 Comments Feed" href="subdirectory/other.html" />
</head>
""";

This will throw the following error:

Unhandled exception:
Invalid argument(s): Text "<head>
    <meta charset="UTF-8">
    <title></title>
    <link rel="alternate" type="application/rss+xml" title="ArtLung &raquo; Limones 🍋 Comments Feed" href="subdirectory/other.html" />
</head>
" must be 130 characters long.
#0      new SourceSpanBase (package:source_span/src/span.dart:85:7)
#1      new SourceSpan (package:source_span/src/span.dart:34:11)
#2      main (file:///Users/filiph/dev/linkcheck/test/source_span_bug.dart:9:24)
#3      _startIsolate.<anonymous closure> (dart:isolate-patch/isolate_patch.dart:265)
#4      _RawReceivePortImpl._handleMessage (dart:isolate-patch/isolate_patch.dart:151)

This is not an issue with package:source_span — when I create the span manually, without parse(), copying it works okay.

filiph avatar Mar 21 '18 23:03 filiph

Hi, friendly nudge. This prevents package:html to be used with HTML that includes unicode chars in attributes. Which is an increasing portion of them (according to bugs reported to linkcheck).

filiph avatar May 27 '19 19:05 filiph

I've created a pull request with a fix: https://github.com/dart-lang/html/pull/109

cvolzke4 avatar Aug 30 '19 02:08 cvolzke4

Carriage returns also affect the file location start and end points.

cvolzke4 avatar Aug 30 '19 07:08 cvolzke4

Hey there, thanks for the great work. Now that the fix is merged, would it be possible to release a new version?

We're stuck with this issue downstream (there: https://github.com/filiph/linkcheck/issues/35)

b4stien avatar Sep 27 '19 13:09 b4stien