Mojo::DOM doesn't handle nested descendant selectors correctly
- Mojolicious version: 9.31
- Perl version: v5.36.0
- Operating system: Linux (Ubuntu 22.04)
Steps to reproduce the behavior
#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<div> ' x 100 . '</div> ' x 100);
is_deeply
$dom->find('#gobbledygook * * * *')->map(sub { $_->to_string })->to_array,
[],
'non-existent elements have no descendants';
done_testing;
Expected behavior
Test passes.
Since the document does not contain an element with id="gobbledygook", the search should fail immediately.
Actual behavior
Test hangs; doesn't finish.
Selectors match right to left, so the slow matching behaviour makes sense. While the performance could be better, i'm not gonna call this a bug.
I call it a bug for two reasons:
- It's counterintuitive. As a user, I would expect a selector that starts with
#footo only examine elements underid="foo"and do no work otherwise. In Mojo::DOM, the situation is reversed: If the ID is found, the code stops quickly; it only hangs if there is no such ID in the document. - There is a potential for DoS in software like NewsExtractor: https://metacpan.org/release/GUGOD/NewsExtractor-v0.45.0/source/lib/NewsExtractor/GenericExtractor.pm#L110
If someone were to (accidentally or on purpose) feed that code a document structure likeelsif ($guess = $dom->at('#details_block .left .date, .article_header > .author > span:last-child')) {'<div class="left date">' x 1000 . '</div>' x 1000, it would make the code spin for a long time (at least several minutes), burning CPU.
So: If you take seemingly harmless code using a selector like #details_block .left .date, you can't run it on untrusted input because it could cause a denial of service. I feel that should be at least pointed out in the documentation.
Right to left matching is actually standard, and used just the same in browsers. If you get better results there it merely means that they have additional optimisations for the case.
I mean, I get better results both right-to-left and left-to-right: https://github.com/mauke/HTML-Blitz/blob/9f0496ac28260675f8bbaabb68ddad477c09231a/t/selector-combinators.t#L155-L169
(The "optimisation" here is not to use backtracking; cf. https://research.swtch.com/glob.)
PRs with backtracking optimisations would of course be very welcome.