mojo icon indicating copy to clipboard operation
mojo copied to clipboard

Mojo::DOM doesn't handle nested descendant selectors correctly

Open mauke opened this issue 2 years ago • 7 comments

  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Linux (Ubuntu 22.04)

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<div> ' x 100 . '</div> ' x 100);

is_deeply
    $dom->find('#gobbledygook * * * *')->map(sub { $_->to_string })->to_array,
    [],
    'non-existent elements have no descendants';

done_testing;

Expected behavior

Test passes.

Since the document does not contain an element with id="gobbledygook", the search should fail immediately.

Actual behavior

Test hangs; doesn't finish.

mauke avatar Jan 23 '23 00:01 mauke

Selectors match right to left, so the slow matching behaviour makes sense. While the performance could be better, i'm not gonna call this a bug.

kraih avatar Jan 29 '23 14:01 kraih

I call it a bug for two reasons:

  1. It's counterintuitive. As a user, I would expect a selector that starts with #foo to only examine elements under id="foo" and do no work otherwise. In Mojo::DOM, the situation is reversed: If the ID is found, the code stops quickly; it only hangs if there is no such ID in the document.
  2. There is a potential for DoS in software like NewsExtractor: https://metacpan.org/release/GUGOD/NewsExtractor-v0.45.0/source/lib/NewsExtractor/GenericExtractor.pm#L110
        elsif ($guess = $dom->at('#details_block .left .date, .article_header > .author > span:last-child')) {
    
    If someone were to (accidentally or on purpose) feed that code a document structure like '<div class="left date">' x 1000 . '</div>' x 1000, it would make the code spin for a long time (at least several minutes), burning CPU.

So: If you take seemingly harmless code using a selector like #details_block .left .date, you can't run it on untrusted input because it could cause a denial of service. I feel that should be at least pointed out in the documentation.

mauke avatar Jan 29 '23 15:01 mauke

Right to left matching is actually standard, and used just the same in browsers. If you get better results there it merely means that they have additional optimisations for the case.

kraih avatar Jan 29 '23 15:01 kraih

I mean, I get better results both right-to-left and left-to-right: https://github.com/mauke/HTML-Blitz/blob/9f0496ac28260675f8bbaabb68ddad477c09231a/t/selector-combinators.t#L155-L169

(The "optimisation" here is not to use backtracking; cf. https://research.swtch.com/glob.)

mauke avatar Jan 29 '23 15:01 mauke

PRs with backtracking optimisations would of course be very welcome.

kraih avatar Jan 29 '23 15:01 kraih