XML XML hangs in Web::Scraper

related to #63?

$ git clone https://github.com/tony-o/perl6-web-scraper
$ cd perl6-web-scraper
$ raku -I. -MXML -e "from-xml('t/data/s05.html'.IO.slurp)"

seems to hang, taking up a single CPU but never (in my limited patience) returning.

Mac M2:

$ sw_vers
ProductName:		macOS
ProductVersion:		14.1.2
BuildVersion:		23B92

Jan 06 '24 03:01 coke

xmllint ../../perl6-web-scraper/t/data/s05.html 
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
  else if (element.events && element.events[type] && handler.$$guid)
                           ^
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
  else if (element.events && element.events[type] && handler.$$guid)
                            ^
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
  else if (element.events && element.events[type] && handler.$$guid)
                                                   ^
../../perl6-web-scraper/t/data/s05.html:50: parser error : xmlParseEntityRef: no name
  else if (element.events && element.events[type] && handler.$$guid)
                                                    ^
../../perl6-web-scraper/t/data/s05.html:135: parser error : StartTag: invalid element name
  for (i = 0; i < col.length; i++)
                 ^
../../perl6-web-scraper/t/data/s05.html:143: parser error : StartTag: invalid element name
  for (var i = 0, j = divs.length; i < j; i++) {
                                      ^
../../perl6-web-scraper/t/data/s05.html:145: parser error : xmlParseEntityRef: no name
    if (curr.id && curr.id.match(/smartlink_(\d+)/)) {
                 ^
../../perl6-web-scraper/t/data/s05.html:145: parser error : xmlParseEntityRef: no name
    if (curr.id && curr.id.match(/smartlink_(\d+)/)) {
                  ^
../../perl6-web-scraper/t/data/s05.html:149: parser error : StartTag: invalid element name
      for (var k = 0, l = toBeRemoved.length; k < l; k++) {
                                                 ^
../../perl6-web-scraper/t/data/s05.html:170: parser error : xmlParseEntityRef: no name
      if ((end.nodeType == 3) && (end.nodeValue.search(/:$/) > -1)) {
                               ^
../../perl6-web-scraper/t/data/s05.html:170: parser error : xmlParseEntityRef: no name
      if ((end.nodeType == 3) && (end.nodeValue.search(/:$/) > -1)) {
                                ^
../../perl6-web-scraper/t/data/s05.html:184: parser error : xmlParseEntityRef: no name
    if (location.hash && location.hash.match(/#.+/)) location.hash = RegExp.last
                       ^
../../perl6-web-scraper/t/data/s05.html:184: parser error : xmlParseEntityRef: no name
    if (location.hash && location.hash.match(/#.+/)) location.hash = RegExp.last
                        ^
../../perl6-web-scraper/t/data/s05.html:208: parser error : Opening and ending tag mismatch: link line 6 and head
</head>
       ^
../../perl6-web-scraper/t/data/s05.html:226: parser error : Opening and ending tag mismatch: br line 225 and em
(<a href="https://github.com/perl6/specs/">syn</a> <strong>8d47115</strong>)</em
                                                                               ^
../../perl6-web-scraper/t/data/s05.html:227: parser error : Entity 'nbsp' not defined
        &nbsp; [ <a href="http://design.perl6.org/">Index of Synopses</a> ]<br>
              ^
../../perl6-web-scraper/t/data/s05.html:243: parser error : Opening and ending tag mismatch: li line 242 and ul
  </ul>
       ^
../../perl6-web-scraper/t/data/s05.html:251: parser error : Opening and ending tag mismatch: li line 250 and ul
  </ul>
       ^
../../perl6-web-scraper/t/data/s05.html:255: parser error : Opening and ending tag mismatch: li line 254 and ul
  </ul>
       ^
../../perl6-web-scraper/t/data/s05.html:283: parser error : Opening and ending tag mismatch: li line 282 and ul
    </ul>
         ^
../../perl6-web-scraper/t/data/s05.html:285: parser error : Opening and ending tag mismatch: li line 284 and ul
  </ul>
       ^
../../perl6-web-scraper/t/data/s05.html:296: parser error : Opening and ending tag mismatch: li line 295 and ul
</ul>
     ^
../../perl6-web-scraper/t/data/s05.html:297: parser error : Opening and ending tag mismatch: li line 294 and div
</div>
      ^
../../perl6-web-scraper/t/data/s05.html:505: parser error : Entity 'ndash' not defined
/master/S05-mass/rx.t#L94-L245"><code>S05-mass/rx.t</code> lines <code>94&ndash;
                                                                               ^
../../perl6-web-scraper/t/data/s05.html:561: parser error : Entity 'ndash' not defined
t#L6-L51"><code>S05-metasyntax/longest-alternative.t</code> lines <code>6&ndash;
                                                                               ^
[... many more cases of ndash not being defined elided ...]

There's a bit of catastrophic backtracking happening for various reasons in the XML grammar, but with that worked around, the XML grammar chokes on a < inside the javascript code near the start, but I'm not sure that the XML grammar is a good fit for HTML parsing except for XHTML which is now exceedingly rarely seen in the wild.

Feb 23 '25 00:02 timo

There's also <br> tags that the XML grammar doesn't know don't need a closing tag to go with them, which would optimally cause a quick parse failure when a mismatched close tag occurs soon after, but perhaps more likely leads to attempting to parse until the end of the document followed by a lot of backtracking and attempts to somehow find a different interpretation?

Feb 23 '25 01:02 timo

XML XML copied to clipboard

XML hangs in Web::Scraper

XML
XML copied to clipboard