scraper icon indicating copy to clipboard operation
scraper copied to clipboard

Memory corruption (probably since 0.21.0)

Open tamasfe opened this issue 8 months ago • 28 comments

We have a service at my company that uses scraper for extracting data from HTML that recently started crashing. I've tracked down the issue that appears after bumping scraper from 0.19.0 to the latest version.

The crash happens once in a blue moon with the system allocator, however with Jemalloc on linux, it happens almost immediately.

I have extracted some of our logic that consistently reproduces the issue and uploaded it here: https://github.com/tamasfe/scraper-segfault . Simply run cargo run and observe the segmentation fault message.

I did not yet dive into the causes and attempts to fix it as tracking it down took me several hours already, I hope the repro is minimal enough for people who are familiar with the project internals.

tamasfe avatar May 16 '25 15:05 tamasfe

Still reproduces using the latest code from master. Valgrind has

==512451== Use of uninitialised value of size 8
==512451==    at 0x5A64A4: selectors::matching::matches_simple_selector (matching.rs:1191)
==512451==    by 0x5A7525: selectors::matching::matches_compound_selector::{{closure}} (matching.rs:1172)
==512451==    by 0x5A73BB: any_value<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::kleene_value::{impl#1}::any_false::{closure_env#0}<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:65)
==512451==    by 0x5A73BB: any_false<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:52)
==512451==    by 0x5A73BB: selectors::matching::matches_compound_selector (matching.rs:1170)
==512451==    by 0x5A9A8A: selectors::matching::matches_complex_selector_internal (matching.rs:825)
==512451==    by 0x5A411C: matches_complex_selector<scraper::element_ref::ElementRef> (matching.rs:445)
==512451==    by 0x5A411C: matches_selector_kleene<scraper::element_ref::ElementRef> (matching.rs:295)
==512451==    by 0x5A411C: matches_selector<scraper::element_ref::ElementRef> (matching.rs:265)
==512451==    by 0x5A411C: scraper::selector::Selector::matches_with_scope_and_cache::{{closure}} (selector.rs:73)
==512451==    by 0x5ADF30: <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::any (macros.rs:308)
==512451==    by 0x529F8E: scraper::selector::Selector::matches_with_scope_and_cache (selector.rs:70)
==512451==    by 0x5646D0: <scraper::html::Select as core::iter::traits::iterator::Iterator>::next (mod.rs:163)
==512451==    by 0x227EEB: scraper_broken::extract_redacted_from_html (main.rs:37)
==512451==    by 0x2278F5: scraper_broken::main (main.rs:15)
==512451==    by 0x22696A: core::ops::function::FnOnce::call_once (function.rs:250)
==512451==    by 0x229C0D: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
==512451== 
==512451== Use of uninitialised value of size 8
==512451==    at 0x5A66C0: to_unconditional_case_sensitivity<scraper::element_ref::ElementRef> (matching.rs:1322)
==512451==    by 0x5A66C0: selectors::matching::matches_simple_selector (matching.rs:1213)
==512451==    by 0x5A7525: selectors::matching::matches_compound_selector::{{closure}} (matching.rs:1172)
==512451==    by 0x5A73BB: any_value<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::kleene_value::{impl#1}::any_false::{closure_env#0}<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:65)
==512451==    by 0x5A73BB: any_false<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:52)
==512451==    by 0x5A73BB: selectors::matching::matches_compound_selector (matching.rs:1170)
==512451==    by 0x5A9A8A: selectors::matching::matches_complex_selector_internal (matching.rs:825)
==512451==    by 0x5A411C: matches_complex_selector<scraper::element_ref::ElementRef> (matching.rs:445)
==512451==    by 0x5A411C: matches_selector_kleene<scraper::element_ref::ElementRef> (matching.rs:295)
==512451==    by 0x5A411C: matches_selector<scraper::element_ref::ElementRef> (matching.rs:265)
==512451==    by 0x5A411C: scraper::selector::Selector::matches_with_scope_and_cache::{{closure}} (selector.rs:73)
==512451==    by 0x5ADF30: <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::any (macros.rs:308)
==512451==    by 0x529F8E: scraper::selector::Selector::matches_with_scope_and_cache (selector.rs:70)
==512451==    by 0x5646D0: <scraper::html::Select as core::iter::traits::iterator::Iterator>::next (mod.rs:163)
==512451==    by 0x227EEB: scraper_broken::extract_redacted_from_html (main.rs:37)
==512451==    by 0x2278F5: scraper_broken::main (main.rs:15)
==512451==    by 0x22696A: core::ops::function::FnOnce::call_once (function.rs:250)
==512451==    by 0x229C0D: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
==512451== 
==512451== 
==512451== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==512451==  Bad permissions for mapped region at address 0x7A1CE5
==512451==    at 0x7A1CE5: ???
==512451==    by 0x22BB26: std::rt::lang_start (rt.rs:198)
==512451==    by 0x228C8D: main (in /home/adam/scraper-segfault/target/debug/scraper-broken)
==512451== 
==512451== HEAP SUMMARY:
==512451==     in use at exit: 0 bytes in 0 blocks
==512451==   total heap usage: 7 allocs, 7 frees, 2,072 bytes allocated
==512451== 
==512451== All heap blocks were freed -- no leaks are possible
==512451== 
==512451== Use --track-origins=yes to see where uninitialised values come from
==512451== For lists of detected and suppressed errors, rerun with: -s
==512451== ERROR SUMMARY: 8 errors from 8 contexts (suppressed: 0 from 0)

to say about it.

adamreichold avatar May 16 '25 15:05 adamreichold

Weirdly enough, the uninitialized memory is stack allocations created by the parsers in selectors and cssparser.

adamreichold avatar May 16 '25 15:05 adamreichold

I minimized the test code further to

use scraper::{Html, Selector};
use tikv_jemallocator::Jemalloc;

#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;

fn main() {
    let test_data = include_str!("../repro.html");

    loop {
        extract_redacted_from_html(test_data);
    }
}

fn extract_redacted_from_html(html: &str) {
    let fragment = Html::parse_fragment(html);
    let row_selector = Selector::parse("tr").unwrap();

    fragment.select(&row_selector).next();
}

adamreichold avatar May 16 '25 16:05 adamreichold

Weirdly enough, the uninitialized memory is stack allocations created by the parsers in selectors and cssparser.

This also fits with keeping the selector itself alive working around the crash:

diff --git a/src/main.rs b/src/main.rs
index 97dfb96..6cf9a95 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -13,8 +13,8 @@ fn main() {
 }
 
 fn extract_redacted_from_html(html: &str) {
-    let fragment = Html::parse_fragment(html);
     let row_selector = Selector::parse("tr").unwrap();
+    let fragment = Html::parse_fragment(html);
 
     fragment.select(&row_selector).next();
 }

adamreichold avatar May 16 '25 16:05 adamreichold

This also explain why we never saw this in production: Our code is basically always structure to parse/allocate selectors one in the beginning and then re-use them for all documents, i.e. the selectors will always outlive the documents in our usage.

adamreichold avatar May 16 '25 16:05 adamreichold

So this appears to happen when the document is dropped after the selector via an invalid string_cache::atom::Atom<markup5ever::LocalNameStaticSet>
thread 'main' panicked at /home/adam/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/string_cache-0.8.9/src/atom.rs:262:25:
null pointer dereference occurred
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread caused non-unwinding panic. aborting.

Program received signal SIGABRT, Aborted.
Downloading 4.48 K source file /usr/src/debug/glibc-2.41/nptl/pthread_kill.c
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44                                                                                                                                                     
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt full
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {17037417024}}
        ret = <optimized out>
#1  0x00007ffff7c9b453 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
No locals.
#2  0x00007ffff7c41cb6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007ffff7c2938b in __GI_abort () at abort.c:73
        act = {__sigaction_handler = {sa_handler = 0x7fffffffca60, sa_sigaction = 0x7fffffffca60}, sa_mask = {__val = {93824998144200, 3758096416, 140737488341608, 93824998147272, 93824997890192, 0, 140737488341880, 140737488341648, 93824997673827, 
              18446744065119640128, 0, 0, 140737488341520, 140737488341504, 140737488341615, 140737488341815}}, sa_flags = 0, sa_restorer = 0x0}
#4  0x0000555555a89a5a in std::sys::pal::unix::abort_internal () at library/std/src/sys/pal/unix/mod.rs:366
No locals.
#5  0x0000555555a88913 in std::panicking::rust_panic_with_hook () at library/std/src/panicking.rs:37
No locals.
#6  0x0000555555a884f6 in std::panicking::begin_panic_handler::{closure#0} () at library/std/src/panicking.rs:699
No locals.
#7  0x0000555555a875a9 in std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::{closure_env#0}, !> () at library/std/src/sys/backtrace.rs:168
No locals.
#8  0x0000555555a881bd in std::panicking::begin_panic_handler () at library/std/src/panicking.rs:697
No locals.
#9  0x0000555555aa3ccd in core::panicking::panic_nounwind_fmt::runtime () at library/core/src/panicking.rs:117
No locals.
#10 core::panicking::panic_nounwind_fmt () at library/core/src/intrinsics/mod.rs:3175
No locals.
#11 0x0000555555aa3ead in core::panicking::panic_null_pointer_dereference () at library/core/src/panicking.rs:311
No locals.
#12 0x0000555555a4ca5a in string_cache::atom::{impl#6}::drop<markup5ever::LocalNameStaticSet> (self=0x7ffff780e340) at /home/adam/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/string_cache-0.8.9/src/atom.rs:262
        entry = 0x0
#13 0x0000555555a48bdb in core::ptr::drop_in_place<string_cache::atom::Atom<markup5ever::LocalNameStaticSet>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#14 0x0000555555a489c0 in core::ptr::drop_in_place<markup5ever::interface::QualName> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#15 0x00005555559ad537 in core::ptr::drop_in_place<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#16 0x00005555559916f1 in core::ptr::mut_ptr::{impl#0}::drop_in_place<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> (self=0x7ffff780e338)
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mut_ptr.rs:1491
No locals.
#17 hashbrown::raw::Bucket<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)>::drop<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> (
    self=0x7fffffffcda8) at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:505
No locals.
#18 hashbrown::raw::RawTableInner::drop_elements<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> (self=0x7ffff78b6568) at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:2106
        item = hashbrown::raw::Bucket<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {ptr: core::ptr::non_null::NonNull<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {pointer: 0x7ffff780e360}}
        iter = hashbrown::raw::RawIter<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {iter: hashbrown::raw::RawIterRange<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {current_group: hashbrown::control::bitmask::BitMaskIter (hashbrown::control::bitmask::BitMask (16384)), data: hashbrown::raw::Bucket<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {ptr: core::ptr::non_null::NonNull<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {pointer: 0x7ffff780e540}}, next_ctrl: 0x7ffff780e550, end: 0x7ffff780e550}, items: 1}
#19 0x00005555559922ad in hashbrown::raw::RawTableInner::drop_inner_table<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>), alloc::alloc::Global> (self=0x7ffff78b6568, alloc=0x7ffff78b6588, table_layout=...)
    at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:2161
No locals.
#20 0x000055555599102c in hashbrown::raw::{impl#18}::drop<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>), alloc::alloc::Global> (self=0x7ffff78b6568) at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:3348
No locals.
#21 0x00005555559ade2b in core::ptr::drop_in_place<hashbrown::raw::RawTable<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>), alloc::alloc::Global>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#22 0x00005555559ae3fb in core::ptr::drop_in_place<hashbrown::map::HashMap<markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>, ahash::random_state::RandomState, alloc::alloc::Global>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#23 0x00005555559ae4cb in core::ptr::drop_in_place<std::collections::hash::map::HashMap<markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>, ahash::random_state::RandomState>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#24 0x00005555559add4b in core::ptr::drop_in_place<ahash::hash_map::AHashMap<markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>, ahash::random_state::RandomState>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#25 0x00005555559ae9ef in core::ptr::drop_in_place<scraper::node::Element> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#26 0x00005555559ae8c1 in core::ptr::drop_in_place<scraper::node::Node> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#27 0x00005555559af00b in core::ptr::drop_in_place<ego_tree::Node<scraper::node::Node>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#28 0x00005555559af540 in core::ptr::drop_in_place<[ego_tree::Node<scraper::node::Node>]> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
--Type <RET> for more, q to quit, c to continue without paging--
No locals.
#29 0x0000555555a1ca83 in alloc::vec::{impl#25}::drop<ego_tree::Node<scraper::node::Node>, alloc::alloc::Global> (self=0x7fffffffd0b8) at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:3804
No locals.
#30 0x00005555559afb77 in core::ptr::drop_in_place<alloc::vec::Vec<ego_tree::Node<scraper::node::Node>, alloc::alloc::Global>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#31 0x00005555559af01b in core::ptr::drop_in_place<ego_tree::Tree<scraper::node::Node>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#32 0x00005555559ae81c in core::ptr::drop_in_place<scraper::html::Html> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#33 0x0000555555668f5c in scraper_broken::extract_redacted_from_html (
    html="\r\n\r\n\r\n<!DOCTYPE HTML>\r\n<html class=\"no-js\">\r\n<head>\r\n    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\r\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\r\n    "...) at src/main.rs:20
No locals.
#34 0x0000555555668d5a in scraper_broken::main () at src/main.rs:11
        test_data = "\r\n\r\n\r\n<!DOCTYPE HTML>\r\n<html class=\"no-js\">\r\n<head>\r\n    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\r\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\r\n    "...

adamreichold avatar May 16 '25 16:05 adamreichold

The only interaction between selectors and string_cache I can see is our CssLocalName which was not introduced in 0.21 but did gain an implementation of PrecomputedHash.

adamreichold avatar May 16 '25 16:05 adamreichold

Hhhmmm, replacing CssLocalName(LocalName) by CssLocalName(String) does not help, so this seems unrelated after all.

The example does minimize further though, i.e.

use scraper::{Html, Selector};
use tikv_jemallocator::Jemalloc;

#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;

fn main() {
    let test_data = include_str!("../repro.html");

    loop {
        let fragment = Html::parse_fragment(test_data);
        let row_selector = Selector::parse("tr").unwrap();
    }
}

adamreichold avatar May 16 '25 16:05 adamreichold

The selector value itself seems irrelevant, but the HTML test data is not... So a bug in html5ever seems possible?

adamreichold avatar May 16 '25 16:05 adamreichold

So I tried to reduce the HTML input and this does indeed move around the crash site.
<!DOCTYPE HTML>
<html class="no-js">
<body class="culture--en-GB">

    <table class="table"  uiid="0-5-0-0-0-0-1" data-sort-parameter="order-by">
        <tbody class="table__body">
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2016
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2017
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2018
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2019
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2020
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2021
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2022
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted redacted
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted Today
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted Yesterday
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


<button name="run-redacted" ui-ajax="ui-ajax" value="0876d065-95f2-4ce6-9ec4-ced1e2b48e7e" uiid="0-5-0-0-0-0-1-34" id="0-5-0-0-0-0-1-34" class=" button--secondary" type="button">
    Run
</button>

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted redacted
                    </td>
                    <td class="table__cell table__cell--grow">

                        Excel (unformatted .xlsx)
                    </td>
                    <td class="table__cell">


<button name="run-redacted" ui-ajax="ui-ajax" value="7bf8a530-7890-4134-a87f-ed8b6be6f944" uiid="0-5-0-0-0-0-1-37" id="0-5-0-0-0-0-1-37" class=" button--secondary" type="button">
    Run
</button>

                    </td>
            </tr>
        </tbody>
    </table>

</body>
</html>

@tamasfe I am sorry to have nothing more definitive, but could you try to get someone from html5ever involved? scraper itself does not have unsafe code and the unsafe code in ego-tree does not seem too likely even though it is possible (downgrading to ego-tree 0.6.2 as in scraper 0.19 also did not fix the issue).

adamreichold avatar May 16 '25 17:05 adamreichold

But then again, doing

use html5ever::tendril::TendrilSink;
use html5ever::{parse_fragment, ns, QualName, local_name};
use markup5ever_rcdom::{RcDom};
let dom = parse_fragment(RcDom::default(), Default::default(), QualName::new(None, ns!(html), local_name!("body")), Vec::new()).one(test_data);

instead of

let fragment = Html::parse_fragment(test_data);

does not crash which would point at our TreeSink implementation in some way...

adamreichold avatar May 16 '25 17:05 adamreichold

does not crash which would point at our TreeSink implementation in some way...

Oh well, but after removing all unsafe code from ego-tree, it still segfaults, i.e. even if we use no unsafe code at all via scraper or ego-tree.

adamreichold avatar May 16 '25 17:05 adamreichold

Thank you for looking into this, I'll try to bisect it and/or run it with miri over the weekend to see if I find anything.

Of course if it's not a scraper issue, I can report it upstream.

tamasfe avatar May 16 '25 18:05 tamasfe

miri over the weekend

I tried Miri, but stacked borrows fails somewhere inside tendril and tree borrows does not handle the pointer-to-integer casts in string-cache.

adamreichold avatar May 16 '25 18:05 adamreichold

I reduced the test data a bit further, the issue seems to be a combination of things in it, the html itself is well-formed, and adding/removing nodes randomly will trigger the issue.

I also looked into downgrading html5ever itself, but there are many breaking changes and I'm not familiar with the code so I gave up on that approach.

tamasfe avatar May 16 '25 19:05 tamasfe

Another possibility besides Miri and Valgrind would be Address Sanitizer, e.g.

> RUSTFLAGS=-Zsanitizer=address cargo +nightly run -Zbuild-std --target x86_64-unknown-linux-gnu

but I also did not get an actually useful stack trace out of that.

I wonder if the aim should be to run the non-crashing variant and get one of those tools to emit something useful instead of crashing into invalid stack frames etc.

adamreichold avatar May 16 '25 19:05 adamreichold

Address Sanitizer

Oddly enough the bug does not seem to trigger it.

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

tamasfe avatar May 16 '25 19:05 tamasfe

Oddly enough the bug does not seem to trigger it.

I get a panic in our tree sink implementation, i.e. html5ever wants to know the name of an element which does not exist. And then Rust's panic handler segfaults trying to create a backtrace which suggests stack-based memory corruption (which jives with Valgrind's report).

adamreichold avatar May 16 '25 19:05 adamreichold

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

Sorry, it works on the reduced test data, but running it with the original test data, it triggers a panic, which may or may not be related to this:

thread 'main' panicked at scraper/scraper/src/html/tree_sink.rs:64:18:
called `Option::unwrap()` on a `None` value

tamasfe avatar May 16 '25 19:05 tamasfe

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

The main difference is that we tell html5ever to assume that everything is wrapped within a body element:

pub fn parse_document(document: &str) -> Self {
    let parser =
        driver::parse_document(HtmlTreeSink::new(Self::new_document()), Default::default());
    parser.one(document)
}

pub fn parse_fragment(fragment: &str) -> Self {
    let parser = driver::parse_fragment(
        HtmlTreeSink::new(Self::new_fragment()),
        Default::default(),
        QualName::new(None, ns!(html), local_name!("body")),
        Vec::new(),
    );
    parser.one(fragment)
}

Both are safe functions though, so it should not be possible to trigger a segmentation fault by one or the other.

adamreichold avatar May 16 '25 19:05 adamreichold

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

Sorry, it works on the reduced test data, but running it with the original test data, it triggers a panic, which may or may not be related to this:

thread 'main' panicked at scraper/scraper/src/html/tree_sink.rs:64:18:
called `Option::unwrap()` on a `None` value

Running the initial reproduction but using parse_document repeatedly, I sometimes get just a segfault and sometimes this panic with a follow-up segfault. As written above, this really smells of stack-based memory corruption which of course makes the backtraces and such dubious.

adamreichold avatar May 16 '25 19:05 adamreichold

I used this crate v0.20.0 with wasm-bindgen, which used to compile and run just fine. But since v0.21.0, there seems to be some issue with allocator. With console error hook, I identified that the dlmalloc code panics.

panicked at /rust/deps/dlmalloc-0.2.7/src/dlmalloc.rs:1198:13:
assertion failed: psize <= size + max_overhead

I suspect that this issue is related to my wasm issue, and might be related to https://github.com/rustwasm/wasm-pack/issues/1389. (I'm on cargo 1.86.0 btw)

dvishal485 avatar May 18 '25 21:05 dvishal485

I had never seen this before! Will start debugging tomorrow

cfvescovo avatar May 18 '25 22:05 cfvescovo

Miri is complaining about Undefined Behavior: attempting a write access using <385749> at alloc158816[0x18], but that tag does not exist in the borrow stack for this location at servo_arc-0.4.0/lib.rs:806:21. I don't know if it's related but someone working on servo might want to look at it

cfvescovo avatar May 21 '25 12:05 cfvescovo

The problem is that there are valid patterns which stacked borrows (Miri's default) does not cover. One can usually get a grip on this by switching to tree borrows which however for technical reasons requires strict provenance which sadly is not compatible with our dependencies (they use integer-to-pointer cast which are not illegal per se).

So think that particular Miri error is a bit questionable, especially for a heavily used crate like servo_arc which for example is used by Firefox.

adamreichold avatar May 21 '25 16:05 adamreichold

Thanks for the explanation, I am not really used to debugging with Miri

cfvescovo avatar May 21 '25 20:05 cfvescovo

I used this crate v0.20.0 with wasm-bindgen, which used to compile and run just fine. But since v0.21.0, there seems to be some issue with allocator. With console error hook, I identified that the dlmalloc code panics.

panicked at /rust/deps/dlmalloc-0.2.7/src/dlmalloc.rs:1198:13:
assertion failed: psize <= size + max_overhead

I suspect that this issue is related to my wasm issue, and might be related to drager/wasm-pack#1389. (I'm on cargo 1.86.0 btw)

@dvishal485 Hey were you able to resolve this? I am having the same issue

har1kaush1k avatar Jul 29 '25 22:07 har1kaush1k

@dvishal485 Hey were you able to resolve this? I am having the same issue

No, I wasn't able to solve the issue on my own. Didn't try out any new release in recent.

dvishal485 avatar Jul 30 '25 03:07 dvishal485