mojo icon indicating copy to clipboard operation
mojo copied to clipboard

Mojo::DOM treats "< foo" as start tag, phantom elements ensue

Open mauke opened this issue 2 years ago โ€ข 7 comments

  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Ubuntu 22.04.1 LTS

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('if a < script then="<!--"> </script> <p>FAIL</p>-->');

is_deeply
    $dom->find('script, p')->map(sub { $_->to_string })->to_array,
    [],
    'fragment contains no tags, just a comment';

done_testing;

Expected behavior

Test passes.

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error. If the parser doesn't abort, it should treat the < character literally, as if &lt; had been seen.

Actual behavior

not ok 1 - fragment contains no tags, just a comment
#   Failed test 'fragment contains no tags, just a comment'
#   at mojo-dom-bug-9.pl line 10.
#     Structures begin differing at:
#          $got->[0] = '<script then="&lt;!--"> </script>'
#     $expected->[0] = Does not exist
1..1
# Looks like you failed 1 test of 1.

mauke avatar Jan 29 '23 21:01 mauke

The following change would implement the correct (at least for HTML) behavior:

diff --git lib/Mojo/DOM/HTML.pm lib/Mojo/DOM/HTML.pm
index e10b1532d..81f54c014 100644
--- lib/Mojo/DOM/HTML.pm
+++ lib/Mojo/DOM/HTML.pm
@@ -36,8 +36,10 @@ my $TOKEN_RE = qr/
     |
       \?(.*?)\?                                                                # Processing Instruction
     |
-      \s*((?:\/\s*)?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)   # Tag
+      (\/?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)             # Tag
     )>
+  |
+    <\/ (?![a-z]) ([^>]*) >                                                    # Invalid-first-character-of-tag-name error (bogus comment)
   |
     (<)                                                                        # Runaway "<"
   )??
@@ -101,12 +103,15 @@ sub parse {
   my $xml     = $self->xml;
   my $current = my $tree = ['root'];
   while ($html =~ /\G$TOKEN_RE/gcso) {
-    my ($text, $doctype, $comment, $cdata, $pi, $tag, $runaway) = ($1, $2, $3, $4, $5, $6, $11);
+    my ($text, $doctype, $comment, $cdata, $pi, $tag, $bogus_comment, $runaway) = ($1, $2, $3, $4, $5, $6, $11, $12);
 
     # Text (and runaway "<")
     $text .= '<'                                 if defined $runaway;
     _node($current, 'text', html_unescape $text) if defined $text;
 
+    # Malformed end tag
+    $comment = $bogus_comment if length $bogus_comment;
+
     # Tag
     if (defined $tag) {
 

That is:

  • < foo> treated as &lt; foo&gt; (invalid-first-character-of-tag-name)
  • </ foo> treated as <!-- foo--> (invalid-first-character-of-tag-name)
  • </> ignored (missing-end-tag-name)

The problem is all the tests in t/mojo/dom.t that rely on the previous behavior. There's so many, I'm not sure what to do about them.

mauke avatar Feb 01 '23 04:02 mauke

If tests are wrong then they should be fixed.

kraih avatar Feb 01 '23 10:02 kraih

The problem is stuff like this:

subtest 'XML name characters' => sub {
  my $dom = Mojo::DOM->new->xml(1)->parse('<Foo><1a>foo</1a></Foo>');
  is $dom->at('Foo')->text, '<1a>foo</1a>',                        'right text';
  is "$dom",                '<Foo>&lt;1a&gt;foo&lt;/1a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.a>foo</.a></Foo>');
  is $dom->at('Foo')->text, '<.a>foo</.a>',                        'right text';
  is "$dom",                '<Foo>&lt;.a&gt;foo&lt;/.a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.>foo</.></Foo>');
  is $dom->at('Foo')->text, '<.>foo</.>',                        'right text';
  is "$dom",                '<Foo>&lt;.&gt;foo&lt;/.&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><-a>foo</-a></Foo>');
  is $dom->at('Foo')->text, '<-a>foo</-a>',                        'right text';
  is "$dom",                '<Foo>&lt;-a&gt;foo&lt;/-a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a1>foo</a1></Foo>');
  is $dom->at('Foo a1')->text, 'foo',                     'right text';
  is "$dom",                   '<Foo><a1>foo</a1></Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a .b -c 1>foo</a></Foo>');
  is $dom->at('Foo')->text, '<a .b -c 1>foo',                  'right text';
  is "$dom",                '<Foo>&lt;a .b -c 1&gt;foo</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<๐Ÿ˜„ ๐Ÿ˜„="๐Ÿ˜„">foo</๐Ÿ˜„>');
  is $dom->at('๐Ÿ˜„')->text, 'foo',              'right text';
  is "$dom",              '<๐Ÿ˜„ ๐Ÿ˜„="๐Ÿ˜„">foo</๐Ÿ˜„>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<ใ“ใ‚“ใซใกใฏ ใ“ใ‚“ใซใกใฏ="ใ“ใ‚“ใซใกใฏ">foo</ใ“ใ‚“ใซใกใฏ>');
  is $dom->at('ใ“ใ‚“ใซใกใฏ')->text, 'foo',                              'right text';
  is "$dom",                  '<ใ“ใ‚“ใซใกใฏ ใ“ใ‚“ใซใกใฏ="ใ“ใ‚“ใซใกใฏ">foo</ใ“ใ‚“ใซใกใฏ>', 'right result';
};

It specifically tests for "incorrect" (for HTML) behavior of the parser. I don't know enough about XML to say whether this is correct for XML, but if so, you might need different tokenizers for HTML and XML. :-/


PS: In HTML mode, the correct parse for

<๐Ÿ˜„ ๐Ÿ˜„="๐Ÿ˜„">foo</๐Ÿ˜„>

would be

lt;๐Ÿ˜„ ๐Ÿ˜„="๐Ÿ˜„"&gt;foo<!--๐Ÿ˜„-->

mauke avatar Feb 02 '23 10:02 mauke

I don't see the relation between < foo and <๐Ÿ˜„ ๐Ÿ˜„="๐Ÿ˜„">. And i'm pretty sure the latter is valid XML, since @mojojs/dom is more strict about what code points to allow for names (based on the XML spec). A separate tokenizer is out of the question, but i don't see the harm in being a bit more relaxed with names anyway. Using the same character ranges as @mojojs/dom would of course be an option.

kraih avatar Feb 02 '23 10:02 kraih

Recall:

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error.

Like space, ๐Ÿ˜„ is not !, /, ?, or ASCII alpha, so it is a parse error.

Consider this example:

<๐Ÿ˜„ title="<script>console.log('hi');</script>"></๐Ÿ˜„>

According to the HTML5 spec, this contains a script element because it parses like

&lt;๐Ÿ˜„ title=&quot;<script>console.log('hi');</script>&quot;&gt;<!--๐Ÿ˜„-->

But Mojo::DOM doesn't see the <script>.

This is essentially unfixable: Browsers will always see a different document structure than Mojo::DOM as long as tag names can start with non-ascii-alpha characters.

mauke avatar Feb 02 '23 10:02 mauke

That is certainly an interesting case. ๐Ÿค”

kraih avatar Feb 02 '23 11:02 kraih

I think the HTML/XML overlap is where we draw the line with correctness, and this will remain a case we handle like it was XML. But we can still make other cases that do not conflict with XML more strict.

kraih avatar Feb 02 '23 11:02 kraih