mojo
mojo copied to clipboard
Mojo::DOM doesn't detect end of script elements correctly
- Mojolicious version: 9.31
- Perl version: v5.36.0
- Operating system: Linux (Ubuntu 22.04)
Steps to reproduce the behavior
#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;
for my $fragment (
'<script> console.log("<!--"); </script>',
'<script> console.log("<!-- <script> -->"); </script>',
'<script> console.log("<!-- <script> </script>"); </script>',
'<script> console.log("<!-- <script> </script> -->"); </script>',
) {
my $dom = Mojo::DOM->new("<!DOCTYPE html>\n$fragment");
is_deeply
$dom->find('script')->map(sub { $_->to_string })->to_array,
[$fragment],
"HTML fragment '$fragment' parses correctly";
}
done_testing;
Expected behavior
Test passes.
Actual behavior
ok 1 - HTML fragment '<script> console.log("<!--"); </script>' parses correctly
ok 2 - HTML fragment '<script> console.log("<!-- <script> -->"); </script>' parses correctly
not ok 3 - HTML fragment '<script> console.log("<!-- <script> </script>"); </script>' parses correctly
# Failed test 'HTML fragment '<script> console.log("<!-- <script> </script>"); </script>' parses correctly'
# at mojo-dom-bug-6.pl line 16.
# Structures begin differing at:
# $got->[0] = '<script> console.log("<!-- <script> </script>'
# $expected->[0] = '<script> console.log("<!-- <script> </script>"); </script>'
not ok 4 - HTML fragment '<script> console.log("<!-- <script> </script> -->"); </script>' parses correctly
# Failed test 'HTML fragment '<script> console.log("<!-- <script> </script> -->"); </script>' parses correctly'
# at mojo-dom-bug-6.pl line 16.
# Structures begin differing at:
# $got->[0] = '<script> console.log("<!-- <script> </script>'
# $expected->[0] = '<script> console.log("<!-- <script> </script> -->"); </script>'
1..4
# Looks like you failed 2 tests of 4.
For the future, please link to the relevant spec sections, it saves us a lot of time and you must have looked it up already anyway before opening the issue.
What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block.
The full explanation can be found here.
Given the possible security implications, i'd say this is a high priority bug.
Consider: https://www.rfc-editor.org/rfc/rfc1866
3.2.5. Comments
To include comments in an HTML document, use a comment declaration. A
comment declaration consists of `<!' followed by zero or more
comments followed by `>'. Each comment starts with `--' and includes
all text up to and including the next occurrence of `--'. In a
comment declaration, white space is allowed after each comment, but
not before the first comment. The entire comment declaration is
ignored.
NOTE - Some historical HTML implementations incorrectly consider
any `>' character to be the termination of a comment.
For example:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HEAD>
<TITLE>HTML Comment Example</TITLE>
<!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp -->
<!-- another -- -- comment -->
<!>
</HEAD>
<BODY>
<p> <!- not a comment, just regular old data characters ->
See here for which specs we follow.