XML
XML copied to clipboard
XML hangs in Web::Scraper
related to #63?
$ git clone https://github.com/tony-o/perl6-web-scraper
$ cd perl6-web-scraper
$ raku -I. -MXML -e "from-xml('t/data/s05.html'.IO.slurp)"
seems to hang, taking up a single CPU but never (in my limited patience) returning.
Mac M2:
$ sw_vers
ProductName: macOS
ProductVersion: 14.1.2
BuildVersion: 23B92
xmllint ../../perl6-web-scraper/t/data/s05.html
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
else if (element.events && element.events[type] && handler.$$guid)
^
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
else if (element.events && element.events[type] && handler.$$guid)
^
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
else if (element.events && element.events[type] && handler.$$guid)
^
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
else if (element.events && element.events[type] && handler.$$guid)
^
../../perl6-web-scraper/t/data/s05.html:135: parser error : StartTag: invalid element name
for (i = 0; i < col.length; i++)
^
../../perl6-web-scraper/t/data/s05.html:143: parser error : StartTag: invalid element name
for (var i = 0, j = divs.length; i < j; i++) {
^
../../perl6-web-scraper/t/data/s05.html:145: parser error : xmlParseEntityRef: no name
if (curr.id && curr.id.match(/smartlink_(\d+)/)) {
^
../../perl6-web-scraper/t/data/s05.html:145: parser error : xmlParseEntityRef: no name
if (curr.id && curr.id.match(/smartlink_(\d+)/)) {
^
../../perl6-web-scraper/t/data/s05.html:149: parser error : StartTag: invalid element name
for (var k = 0, l = toBeRemoved.length; k < l; k++) {
^
../../perl6-web-scraper/t/data/s05.html:170: parser error : xmlParseEntityRef: no name
if ((end.nodeType == 3) && (end.nodeValue.search(/:$/) > -1)) {
^
../../perl6-web-scraper/t/data/s05.html:170: parser error : xmlParseEntityRef: no name
if ((end.nodeType == 3) && (end.nodeValue.search(/:$/) > -1)) {
^
../../perl6-web-scraper/t/data/s05.html:184: parser error : xmlParseEntityRef: no name
if (location.hash && location.hash.match(/#.+/)) location.hash = RegExp.last
^
../../perl6-web-scraper/t/data/s05.html:184: parser error : xmlParseEntityRef: no name
if (location.hash && location.hash.match(/#.+/)) location.hash = RegExp.last
^
../../perl6-web-scraper/t/data/s05.html:208: parser error : Opening and ending tag mismatch: link line 6 and head
</head>
^
../../perl6-web-scraper/t/data/s05.html:226: parser error : Opening and ending tag mismatch: br line 225 and em
(<a href="https://github.com/perl6/specs/">syn</a> <strong>8d47115</strong>)</em
^
../../perl6-web-scraper/t/data/s05.html:227: parser error : Entity 'nbsp' not defined
[ <a href="http://design.perl6.org/">Index of Synopses</a> ]<br>
^
../../perl6-web-scraper/t/data/s05.html:243: parser error : Opening and ending tag mismatch: li line 242 and ul
</ul>
^
../../perl6-web-scraper/t/data/s05.html:251: parser error : Opening and ending tag mismatch: li line 250 and ul
</ul>
^
../../perl6-web-scraper/t/data/s05.html:255: parser error : Opening and ending tag mismatch: li line 254 and ul
</ul>
^
../../perl6-web-scraper/t/data/s05.html:283: parser error : Opening and ending tag mismatch: li line 282 and ul
</ul>
^
../../perl6-web-scraper/t/data/s05.html:285: parser error : Opening and ending tag mismatch: li line 284 and ul
</ul>
^
../../perl6-web-scraper/t/data/s05.html:296: parser error : Opening and ending tag mismatch: li line 295 and ul
</ul>
^
../../perl6-web-scraper/t/data/s05.html:297: parser error : Opening and ending tag mismatch: li line 294 and div
</div>
^
../../perl6-web-scraper/t/data/s05.html:505: parser error : Entity 'ndash' not defined
/master/S05-mass/rx.t#L94-L245"><code>S05-mass/rx.t</code> lines <code>94–
^
../../perl6-web-scraper/t/data/s05.html:561: parser error : Entity 'ndash' not defined
t#L6-L51"><code>S05-metasyntax/longest-alternative.t</code> lines <code>6–
^
[... many more cases of ndash not being defined elided ...]
There's a bit of catastrophic backtracking happening for various reasons in the XML grammar, but with that worked around, the XML grammar chokes on a < inside the javascript code near the start, but I'm not sure that the XML grammar is a good fit for HTML parsing except for XHTML which is now exceedingly rarely seen in the wild.
There's also <br> tags that the XML grammar doesn't know don't need a closing tag to go with them, which would optimally cause a quick parse failure when a mismatched close tag occurs soon after, but perhaps more likely leads to attempting to parse until the end of the document followed by a lot of backtracking and attempts to somehow find a different interpretation?