console consider redesigning resource states to use `tracing` more idiomatically

Currently, state attributes for resources and async ops are represented by tracing events with a particular structure, within the resource's span. This is somewhat complicated, both in the instrumentation and in console-subscriber. These events consist of a field with the attribute's name and a value, and two additional optional fields for numeric values: a field <attribute_name>.op, which can be "add", "sub", or "override", and a field <attribute_name>.unit, which is used to optionally add a unit such as ms. The op field is used to record numeric values as a delta rather than as an absolute value.

Meanwhile, tracing already has a mechanism for attaching field values to spans, and recording new values for those fields, via the Span::record method. I think we should change state attributes for resources/async ops to use this instead.

Motivation

Here are some of the issues and questions I've noticed about the current approach.

It's fairly complicated
When spans for resources and async ops are created, there's no way to know all of the state attributes that will exist for that particular resource.

This means that as individual state_update events occur, the attributes are added to a hashmap one-by-one as the console layer becomes aware of those attributes. This isn't very efficient, as we might reallocate the hashmap multiple times. Also, if these maps are moved out of the aggregator task and updated in the ConsoleLayer, as in #238, they would have to be mutated to insert new attribute names, which would require adding a lock, even if the values themselves can all be updated atomically (since they're all integers or booleans).

If all these fields were declared up front on the span, however, we would know about all the fields that exist when the resource/async op is created, even if their values aren't populated yet. That way, we could create a single hashmap (or array) with all the attributes already present, and insert it into the span's extensions. This would only need a single allocation. If the values of those attributes are updated atomically, we wouldn't need any kind of locking to update or read from them.
I don't understand the advantage of recording changes in numeric values as deltas rather than as absolute values (the .op field and "add"/"sub"). All the values will eventually be reported to the console as absolute values in the wire format anyway. If we want to use these to display the "direction" in which a value has changed since the last update, we would have to implement that logic in the UI anyway, because the delta value is lost when sending the message on the wire.

What do we gain from recording these values as deltas? As far as I can tell, it just makes the implementation much more complex.
I'm not sure if recording units as a structured field really buys us anything.

In fact, taking the units at each state update just complicates the recording logic, because it allows the unit to differ each time the state is updated. For example, if the previous state update event for a field was
```
tracing::trace!(target: "runtime::resource::state_update", my_duration = 100, my_duration.op = "add", my_duration.unit = "ms"); 
```
and then another one is recorded like this:
```
tracing::trace!(target: "runtime::resource::state_update", my_duration = 360, my_duration.op = "add", my_duration.unit = "us"); 
```
the console-subscriber recording code would have to implement the unit conversion to record the new value correctly. This seems unnecessarily complex to me.

As far as I can tell, the units field is just to inform the UI about how to display the value. I think that we should change this so that the units are defined by the field name as whatever the lowest-granularity unit that we'll record the resource's state in. That way, console-subscriber doesn't need to be aware of units at all, and the console UI can detect base units from the name of the field, and use that to display the value however it wants to.

Proposal

Here's the change I think we should make. I think we should remove the runtime::resource::state_update and runtime::async_op::state_update events, and replace them with fields declared up front on the resource/async op's span. We should remove the op fields, and always record absolute values. Finally, rather than recording units as an additional field, we should just add a base unit to the end of the field name when appropriate, and allow the console UI to handle formatting based on units.

As an example of how we'd change this, consider tokio::time::Sleep. Currently, the instrumentation looks like this:

impl Sleep {
    pub(crate) fn new_timeout(
        deadline: Instant,
        location: Option<&'static Location<'static>>,
    ) -> Sleep {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        let inner = {
            // ...
            let resource_span = tracing::trace_span!(
                "runtime.resource",
                concrete_type = "Sleep",
                kind = "timer",
                loc.file = location.file(),
                loc.line = location.line(),
                loc.col = location.column(),
            );

            let async_op_span = resource_span.in_scope(|| {
                tracing::trace!(
                    target: "runtime::resource::state_update",
                    duration = duration,
                    duration.unit = "ms",
                    duration.op = "override",
                );
                // ...
            });
         // ...
    }


    fn reset_inner(self: Pin<&mut Self>, deadline: Instant) {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        {
            let _resource_enter = me.inner.ctx.resource_span.enter();
            // ...
            let duration =  /* ... */;

            tracing::trace!(
                target: "runtime::resource::state_update",
                duration = duration,
                duration.unit = "ms",
                duration.op = "override",
            );
        }
    }

    // ... and so on ...
}

After my proposed change, it would look like this instead, which I think is much simpler:

impl Sleep {
    pub(crate) fn new_timeout(
        deadline: Instant,
        location: Option<&'static Location<'static>>,
    ) -> Sleep {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        let inner = {
            // ...
            let resource_span = tracing::trace_span!(
                "runtime.resource",
                concrete_type = "Sleep",
                kind = "timer",
                loc.file = location.file(),
                loc.line = location.line(),
                loc.col = location.column(),
                duration_ms = duration,
            );

            let async_op_span = resource_span.in_scope(|| {
                // ... we would still create the async op span, and it would now also 
                // declare its fields up front on the span ...
            });
         // ...
    }


    fn reset_inner(self: Pin<&mut Self>, deadline: Instant) {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        {
            let _resource_enter = me.inner.ctx.resource_span.enter();
            // ...
            let duration =  /* ... */;
            // call `Span::record` with the new value.
            me.inner.ctx.resource_span.record("duration_ms", &duration);
        }
    }

    // ... and so on ...
}

Potential Issues

We would probably still want to handle the current state_update events as well, so that resources/async ops from tokio 1.15.x will still work with the console. This might limit some of the optimizations we can do if we use the new system instead, but I think we can at least make it so that most of those costs are only paid when using the older versions.

Dec 29 '21 19:12 hawkw

cc @zaharidichev what do you think of this proposal?

Dec 29 '21 19:12 hawkw

@hawkw Most of this sound like very nice improvements. The only thing I am unsure about is the deltas part. Part of the reason I went with recording deltas (and that might very well be not such a good reason) is the fact that it seemed to me that it made it more straightforward to instrument things. Namely looking at some of the instrumentation code, it seems that it is cheaper to record deltas as opposed to recording the state. Take this piece for example:

impl<'a, T: ?Sized> Drop for RwLockReadGuard<'a, T> {
    fn drop(&mut self) {
        self.s.release(1);

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        self.resource_span.in_scope(|| {
            tracing::trace!(
            target: "runtime::resource::state_update",
            current_readers = 1,
            current_readers.op = "sub",
            )
        });
    }
}

Being able to emit the current readers from this point in the code would involve an additional Acquire load. Would that be a problem? Is it probably more easy to mess things up and introduce strange data races if not careful when instrumenting? Anyway, these are my only concerns. Happy to work on that otherwise. I will be able to pick it up in the first week of the new year though as I do not have a lot of access to my computer now.

Dec 30 '21 17:12 zaharidichev

Hmm, I do see why deltas are easier on the instrumentation side in cases like RwLock/Mutex, especially since the actual count is inside the semaphore and not exposed publicly. That is a potential concern I hadn't really thought of.

Internally, the semaphore will perform a fetch_add when when dropping a read guard: https://github.com/tokio-rs/tokio/blob/47feaa7a8957b47dc1c4b91fc9085ddebd11e5ed/tokio/src/sync/batch_semaphore.rs#L312 so, we have already loaded the number of readers in that case, and could record the absolute value as (value returned by fetch_add) + 1. But, the actual recording of the absolute values could be racy if multiple threads are recording at the same time...if one records a lower absolute value than another, but the thread recording the lower value sees the value recorded after the one recording the higher value, we would miss the higher value.

So, I think there is a good argument for deltas based on that, hmm. I'll have to keep thinking about this.

Even if we are going to continue recording values as deltas, though, it would be nice to have a way to preregister which attributes a resource or async op will have when it's created. Possibly, we could add that on top of the current system, if we stick with the current system to avoid racy count updates. :thinking:

Dec 30 '21 17:12 hawkw

One option might be to switch to a single field on the span, and use Span::record, but to special-case fields whose names end in _delta, and record i64s to indicate add/sub. That way, we could still record deltas when necessary, but could simplify the recording to use span fields and Span::record?

Dec 30 '21 18:12 hawkw

Yes, something like that might work. So apart from any potential simplifications, the main motivation for all this is to be able to know the set of updatable state fields upfront, correct? I kind of think that having access to the span at every point in the code where you would want to fire a state update might require quite a lot of wiring of spans through structs, how about a different approach.

Each async_op/resource span is created with a set of fields that represent its attributes and are empty, or have some default value for the type because if they are empty we would now know the attr type upfront. So anyway, now we know what attributes we can expect on these resources and we can stick these into the extension. Then if we keep the state update logic similar where we simply issue events, we can only process state updates for attributes that are known to use upfront. Correct me if I am wrong but this mostly solves the problem with the mutable hashmap.

Or maybe there is a much better way to do this state updating altogether and we should completely revamp it.

Dec 30 '21 18:12 zaharidichev

I kind of think that having access to the span at every point in the code where you would want to fire a state update might require quite a lot of wiring of spans through structs,

Hmm, I'm not sure how much of an issue this actually is; as far as i can tell, we already have the span wired through to most of the places where state updates are recorded? Typically, the spans are owned by the struct representing the resource or async op, and it looks like all the state update events are currently associated with those spans by calling span.in_scope(|| { ... }) to enter the span explicitly when recording the state update.

Incidentally, this is significantly less efficient than just using Span::record: the in_scope requires three Subscriber callbacks, one to enter the span, one to emit the event, and one to exit the span, and when the event is recorded, we have to traverse the span stack to find the span to associate the event with, and then look up the span by ID and record the state update. If we used Span::record, there would only be a single subscriber callback, on_record, which would be called with the ID of the span, so all we would have to do is look up the span by ID and record the update.

Dec 30 '21 20:12 hawkw

That makes sense indeed. I will look into that this week and post a draft PR so we can discuss it. Thanks for thinking through that!

Jan 02 '22 20:01 zaharidichev

@hawkw Now that #238 is merged I would like to tackle this but I had a couple of questions. So assuming that we know attributes upfront we can have as you mentioned a map of these attributes on the AsyncOp and Resource stats and have the fields be atomic vals, so something like:

struct Attributes {
    attributes: HashMap<FieldKey, Attribute>,
}

struct Attribute {
    name: proto::field::Name,
    value: Option<Value>,
}

 enum Value {
    U64Val(AtomicU64),
    I64Val(AtomicI64),
    BoolVal(AtomicBool),
    StrVal(AtomicPtr<String>),
    DebugVal(AtomicPtr<String>),
}

Where I am not entirely sure is the fact that the value of the attribute might be initialized in the case when we create the field with some known value (like the duration of a Sleep) or uninitialized when the value is initially not known and will later be provided by the first call to Span::record. So can we express that with a OnceCell in this case, or do we need a Mutex?

Jan 12 '22 20:01 zaharidichev

Where I am not entirely sure is the fact that the value of the attribute might be initialized in the case when we create the field with some known value (like the duration of a Sleep) or uninitialized when the value is initially not known and will later be provided by the first call to Span::record. So can we express that with a OnceCell in this case, or do we need a Mutex?

I think you can probably do that with a OnceCell. However, I can think of a couple approaches that avoid the use of both OnceCell and Mutex.

One thing we could do is:

remove the Option
initialize the atomics to zero/false/null pointer when they're not present
add an AtomicBool to the Attribute struct that tracks whether an initial value has been recorded

That way, we can differentiate between 0 and false values that actually mean a 0 or a false was recorded, and those that are initial values, and skip recording a field when the AtomicBool is unset. I will note that this isn't strictly necessary for the Str and Debug values, because null pointers aren't a valid value to ever actually record.

This has the downside that we have to synchronize between two atomics, but I think the behavior here shouldn't actually be all that complex, so it's probably fine.

Another option is to select a sentinel value for the unrecorded state, like u64::MAX and i64::MAX for the numerics, null for the strings, and so on. To do this, we'd have to change the bool value from being represented as an AtomicBool to a number, and use 0 for false, 1 for true, and 2 for "unset", or something. This means we still only have a single atomic that we have to synchronize on. However, it has the disadvantage that we can't record max values for integers, which seems not great...

Jan 12 '22 21:01 hawkw

@hawkw I think the problem with this approach is that at the point of span creation if a field has been defined as empty, then we do not really know what variant of the Value to create the attribute with. Am I failing to understand something?

Jan 14 '22 14:01 zaharidichev

Oh, yeah, I hadn't thought about that. Hmm. We could just store the value as an AtomicU64, and then have an additional atomic metadata field that records what those bits should actually be interpreted as?

Jan 14 '22 23:01 hawkw