tracing icon indicating copy to clipboard operation
tracing copied to clipboard

Slow `register_callsite()` can cause other subscribers to not receive `register_callsite()` at all

Open hds opened this issue 10 months ago • 1 comments

Bug Report

There is code below that reproduces this issue consistently (increase the sleep time to make it 100%).

I've uploaded a branch containing the test that reproduces the issue and my first attempt at a fix (which doesn't appear to fix anything). There's a draft PR open against v0.1.x to show the different #2744.

Version

  • tracing-core: 0.1.30
  • tracing: 0.1.37 (not necessary, can also be reproduced from just tracing-core)

Also confirmed present on the v0.1.x branch.

This error is not reproducible on master.

Platform

Linux and macOS confirmed. Have not tested on Windows.

  • Linux <name> 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Darwin <name> 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:31 PST 2024; root:xnu-10063.101.15~2/RELEASE_X86_64 x86_64

Crates

tracing-core

Description

If a Subscriber's register_callsite() function takes too long to complete, it can cause other subscribers to not have register_callsite called at all. Even if the second subscriber is the one which is then recording the tracing event at that callsite.

  1. We start two threads, each thread: a. Registers a local Subscriber (via tracing::subscriber::set_default) b. Records a single event.

When the first subscriber takes too long inside register_callsite(), the second subscriber doesn't receive a call to register_callsite() at all, instead it receives a call to event() directly.

What is too long?

By ensuring the order in which the threads start executing, a thread::sleep of 50 microseconds will

I can reliably reproduce this issue with the code below with a thread::sleep of 50 microseconds, although it will sometimes succeed. I have seen it reproduce with a sleep of only 20 microseconds, although that was less frequent (3 failures out of 10 tries just now). The machine is a 6 core 3.xGHz Dell desktop that's about 6 years old - so it's not top of the line, but also not underpowered. Keep in mind that the test is calling println!, so that may be adding additional delay.

Here is a full reproduction of the issue as an integration test is below. Note that the assert that would actually make this test fail has been commented out so that it can be observed that subscriber-2 doesn't receive a call to register_callsite() at all (it is not that it receives it after the call to event() which I had originally suspected).

use std::{
    ptr,
    sync::atomic::{AtomicPtr, Ordering},
    thread::{self, JoinHandle},
    time::Duration,
};

use tracing_core::{
    callsite::{Callsite as _, DefaultCallsite},
    dispatcher::set_default,
    field::{FieldSet, Value},
    span, Dispatch, Event, Kind, Level, Metadata, Subscriber,
};

struct TestSubscriber {
    sleep: Duration,
    callsite: AtomicPtr<Metadata<'static>>,
}

impl TestSubscriber {
    fn new(sleep_micros: u64) -> Self {
        Self {
            sleep: Duration::from_micros(sleep_micros),
            callsite: AtomicPtr::new(ptr::null_mut()),
        }
    }
}

impl Subscriber for TestSubscriber {
    fn register_callsite(&self, metadata: &'static Metadata<'static>) -> tracing_core::Interest {
        if !self.sleep.is_zero() {
            thread::sleep(self.sleep);
        }

        self.callsite
            .store(metadata as *const _ as *mut _, Ordering::SeqCst);

        tracing_core::Interest::always()
    }

    fn event(&self, event: &tracing_core::Event<'_>) {
        let stored_callsite = self.callsite.load(Ordering::SeqCst);
        let event_callsite: *mut Metadata<'static> = event.metadata() as *const _ as *mut _;

        // This assert is the actual test.
        assert_eq!(
            stored_callsite, event_callsite,
            "stored callsite: {stored_callsite:#?} does not match event \
            callsite: {event_callsite:#?}. Was `event` called before \
            `register_callsite`?"
        );
    }

    fn enabled(&self, _metadata: &Metadata<'_>) -> bool {
        true
    }
    fn new_span(&self, _span: &span::Attributes<'_>) -> span::Id {
        span::Id::from_u64(0)
    }
    fn record(&self, _span: &span::Id, _values: &span::Record<'_>) {}
    fn record_follows_from(&self, _span: &span::Id, _follows: &span::Id) {}
    fn enter(&self, _span: &tracing_core::span::Id) {}
    fn exit(&self, _span: &tracing_core::span::Id) {}
}

fn subscriber_thread(idx: usize, register_sleep_micros: u64) -> JoinHandle<()> {
    thread::Builder::new()
        .name(format!("subscriber-{idx}"))
        .spawn(move || {
            // We use a sleep to ensure the starting order of the 2 threads.
            let subscriber = TestSubscriber::new(register_sleep_micros);
            let _dispatch_guard = set_default(&Dispatch::new(subscriber));

            static CALLSITE: DefaultCallsite = {
                // The values of the metadata are unimportant
                static META: Metadata<'static> = Metadata::new(
                    "event ",
                    "module::path",
                    Level::INFO,
                    None,
                    None,
                    None,
                    FieldSet::new(&["message"], tracing_core::callsite::Identifier(&CALLSITE)),
                    Kind::EVENT,
                );
                DefaultCallsite::new(&META)
            };
            let _interest = CALLSITE.interest();

            let meta = CALLSITE.metadata();
            let field = meta.fields().field("message").unwrap();
            let message = format!("event-from-{idx}", idx = idx);
            let values = [(&field, Some(&message as &dyn Value))];
            let value_set = CALLSITE.metadata().fields().value_set(&values);

            Event::dispatch(meta, &value_set);

            // Wait a bit for everything to end (we don't want to remove the subscriber
            // immediately because that will influence the test).
            thread::sleep(Duration::from_millis(10));
        })
        .expect("failed to spawn thread")
}

/// Regression test for missing register_callsite call (#2743)
///
/// This test provokes the race condition which causes the second subscriber to not receive a
/// call to `register_callsite` before it receives a call to `event`.
///
/// Because the test depends on the interaction of multiple dispatchers in different threads,
/// it needs to be in a test file by itself.
#[test]
fn event_before_register() {
    let subscriber_1_register_sleep_micros = 30;
    let subscriber_2_register_sleep_micros = 0;

    let jh1 = subscriber_thread(1, subscriber_1_register_sleep_micros);

    // This delay ensures that the event callsite has interest() called first.
    thread::sleep(Duration::from_micros(10));
    let jh2 = subscriber_thread(2, subscriber_2_register_sleep_micros);

    jh1.join().expect("failed to join thread");
    jh2.join().expect("failed to join thread");
}

If we add println! messages to the register_callsite and event methods, we get the following output:

subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x0000000000000000)
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)

When the sleep in the first subscriber is set to 0, the expected behavior is observed:

subscriber-1: register_callsite: 0x000055555564bb88
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)
subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)

My best guess is that the reregistration of the callsite (because there's a new subscriber subscriber-2 in the new thread) causes subscriber-2 to skip registering the callsite of the event when it executes it.

A description of the fix can be found in #2938.

hds avatar Oct 10 '23 14:10 hds

Thanks for the repro, I can look into the potential fix and see what else is necessary to make it pass!

hawkw avatar Oct 10 '23 16:10 hawkw