mojo icon indicating copy to clipboard operation
mojo copied to clipboard

Mojo::DOM doesn't detect end of script elements correctly

Open mauke opened this issue 2 years ago • 4 comments

  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Linux (Ubuntu 22.04)

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

for my $fragment (
    '<script> console.log("<!--"); </script>',
    '<script> console.log("<!-- <script> -->"); </script>',
    '<script> console.log("<!-- <script> </script>"); </script>',
    '<script> console.log("<!-- <script> </script> -->"); </script>',
) {
    my $dom = Mojo::DOM->new("<!DOCTYPE html>\n$fragment");

    is_deeply
        $dom->find('script')->map(sub { $_->to_string })->to_array,
        [$fragment],
        "HTML fragment '$fragment' parses correctly";
}

done_testing;

Expected behavior

Test passes.

Actual behavior

ok 1 - HTML fragment '<script> console.log("<!--"); </script>' parses correctly
ok 2 - HTML fragment '<script> console.log("<!-- <script> -->"); </script>' parses correctly
not ok 3 - HTML fragment '<script> console.log("<!-- <script> </script>"); </script>' parses correctly
#   Failed test 'HTML fragment '<script> console.log("<!-- <script> </script>"); </script>' parses correctly'
#   at mojo-dom-bug-6.pl line 16.
#     Structures begin differing at:
#          $got->[0] = '<script> console.log("<!-- <script> </script>'
#     $expected->[0] = '<script> console.log("<!-- <script> </script>"); </script>'
not ok 4 - HTML fragment '<script> console.log("<!-- <script> </script> -->"); </script>' parses correctly
#   Failed test 'HTML fragment '<script> console.log("<!-- <script> </script> -->"); </script>' parses correctly'
#   at mojo-dom-bug-6.pl line 16.
#     Structures begin differing at:
#          $got->[0] = '<script> console.log("<!-- <script> </script>'
#     $expected->[0] = '<script> console.log("<!-- <script> </script> -->"); </script>'
1..4
# Looks like you failed 2 tests of 4.

mauke avatar Jan 23 '23 01:01 mauke

For the future, please link to the relevant spec sections, it saves us a lot of time and you must have looked it up already anyway before opening the issue.

What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block.

The full explanation can be found here.

kraih avatar Jan 29 '23 14:01 kraih

Given the possible security implications, i'd say this is a high priority bug.

kraih avatar Jan 29 '23 14:01 kraih

Consider: https://www.rfc-editor.org/rfc/rfc1866

3.2.5. Comments

   To include comments in an HTML document, use a comment declaration. A
   comment declaration consists of `<!' followed by zero or more
   comments followed by `>'. Each comment starts with `--' and includes
   all text up to and including the next occurrence of `--'. In a
   comment declaration, white space is allowed after each comment, but
   not before the first comment.  The entire comment declaration is
   ignored.

      NOTE - Some historical HTML implementations incorrectly consider
      any `>' character to be the termination of a comment.

   For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <HEAD>
    <TITLE>HTML Comment Example</TITLE>
    <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
    <!-- another -- -- comment -->
    <!>
    </HEAD>
    <BODY>
    <p> <!- not a comment, just regular old data characters ->

poti1 avatar Jun 28 '23 16:06 poti1

See here for which specs we follow.

kraih avatar Jun 28 '23 17:06 kraih