mojo Mojo::DOM treats "< foo" as start tag, phantom elements ensue

Mojolicious version: 9.31
Perl version: v5.36.0
Operating system: Ubuntu 22.04.1 LTS

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('if a < script then="<!--"> </script> <p>FAIL</p>-->');

is_deeply
    $dom->find('script, p')->map(sub { $_->to_string })->to_array,
    [],
    'fragment contains no tags, just a comment';

done_testing;

Expected behavior

Test passes.

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error. If the parser doesn't abort, it should treat the < character literally, as if < had been seen.

Actual behavior

not ok 1 - fragment contains no tags, just a comment
#   Failed test 'fragment contains no tags, just a comment'
#   at mojo-dom-bug-9.pl line 10.
#     Structures begin differing at:
#          $got->[0] = '<script then="&lt;!--"> </script>'
#     $expected->[0] = Does not exist
1..1
# Looks like you failed 1 test of 1.

Jan 29 '23 21:01 mauke

The following change would implement the correct (at least for HTML) behavior:

diff --git lib/Mojo/DOM/HTML.pm lib/Mojo/DOM/HTML.pm
index e10b1532d..81f54c014 100644
--- lib/Mojo/DOM/HTML.pm
+++ lib/Mojo/DOM/HTML.pm
@@ -36,8 +36,10 @@ my $TOKEN_RE = qr/
     |
       \?(.*?)\?                                                                # Processing Instruction
     |
-      \s*((?:\/\s*)?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)   # Tag
+      (\/?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)             # Tag
     )>
+  |
+    <\/ (?![a-z]) ([^>]*) >                                                    # Invalid-first-character-of-tag-name error (bogus comment)
   |
     (<)                                                                        # Runaway "<"
   )??
@@ -101,12 +103,15 @@ sub parse {
   my $xml     = $self->xml;
   my $current = my $tree = ['root'];
   while ($html =~ /\G$TOKEN_RE/gcso) {
-    my ($text, $doctype, $comment, $cdata, $pi, $tag, $runaway) = ($1, $2, $3, $4, $5, $6, $11);
+    my ($text, $doctype, $comment, $cdata, $pi, $tag, $bogus_comment, $runaway) = ($1, $2, $3, $4, $5, $6, $11, $12);
 
     # Text (and runaway "<")
     $text .= '<'                                 if defined $runaway;
     _node($current, 'text', html_unescape $text) if defined $text;
 
+    # Malformed end tag
+    $comment = $bogus_comment if length $bogus_comment;
+
     # Tag
     if (defined $tag) {

That is:

< foo> treated as < foo> (invalid-first-character-of-tag-name)
</ foo> treated as  (invalid-first-character-of-tag-name)
</> ignored (missing-end-tag-name)

The problem is all the tests in t/mojo/dom.t that rely on the previous behavior. There's so many, I'm not sure what to do about them.

Feb 01 '23 04:02 mauke

If tests are wrong then they should be fixed.

Feb 01 '23 10:02 kraih

The problem is stuff like this:

subtest 'XML name characters' => sub {
  my $dom = Mojo::DOM->new->xml(1)->parse('<Foo><1a>foo</1a></Foo>');
  is $dom->at('Foo')->text, '<1a>foo</1a>',                        'right text';
  is "$dom",                '<Foo>&lt;1a&gt;foo&lt;/1a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.a>foo</.a></Foo>');
  is $dom->at('Foo')->text, '<.a>foo</.a>',                        'right text';
  is "$dom",                '<Foo>&lt;.a&gt;foo&lt;/.a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.>foo</.></Foo>');
  is $dom->at('Foo')->text, '<.>foo</.>',                        'right text';
  is "$dom",                '<Foo>&lt;.&gt;foo&lt;/.&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><-a>foo</-a></Foo>');
  is $dom->at('Foo')->text, '<-a>foo</-a>',                        'right text';
  is "$dom",                '<Foo>&lt;-a&gt;foo&lt;/-a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a1>foo</a1></Foo>');
  is $dom->at('Foo a1')->text, 'foo',                     'right text';
  is "$dom",                   '<Foo><a1>foo</a1></Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a .b -c 1>foo</a></Foo>');
  is $dom->at('Foo')->text, '<a .b -c 1>foo',                  'right text';
  is "$dom",                '<Foo>&lt;a .b -c 1&gt;foo</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<😄 😄="😄">foo</😄>');
  is $dom->at('😄')->text, 'foo',              'right text';
  is "$dom",              '<😄 😄="😄">foo</😄>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<こんにちは こんにちは="こんにちは">foo</こんにちは>');
  is $dom->at('こんにちは')->text, 'foo',                              'right text';
  is "$dom",                  '<こんにちは こんにちは="こんにちは">foo</こんにちは>', 'right result';
};

It specifically tests for "incorrect" (for HTML) behavior of the parser. I don't know enough about XML to say whether this is correct for XML, but if so, you might need different tokenizers for HTML and XML. :-/

PS: In HTML mode, the correct parse for

<😄 😄="😄">foo</😄>

would be

lt;😄 😄="😄"&gt;foo<!--😄-->

Feb 02 '23 10:02 mauke

I don't see the relation between < foo and <😄 😄="😄">. And i'm pretty sure the latter is valid XML, since @mojojs/dom is more strict about what code points to allow for names (based on the XML spec). A separate tokenizer is out of the question, but i don't see the harm in being a bit more relaxed with names anyway. Using the same character ranges as @mojojs/dom would of course be an option.

Feb 02 '23 10:02 kraih

Recall:

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error.

Like space, 😄 is not !, /, ?, or ASCII alpha, so it is a parse error.

Consider this example:

<😄 title="<script>console.log('hi');</script>"></😄>

According to the HTML5 spec, this contains a script element because it parses like

&lt;😄 title=&quot;<script>console.log('hi');</script>&quot;&gt;<!--😄-->

But Mojo::DOM doesn't see the <script>.

This is essentially unfixable: Browsers will always see a different document structure than Mojo::DOM as long as tag names can start with non-ascii-alpha characters.

Feb 02 '23 10:02 mauke

That is certainly an interesting case. 🤔

Feb 02 '23 11:02 kraih

I think the HTML/XML overlap is where we draw the line with correctness, and this will remain a case we handle like it was XML. But we can still make other cases that do not conflict with XML more strict.

Feb 02 '23 11:02 kraih