rumqtt icon indicating copy to clipboard operation
rumqtt copied to clipboard

rumqttd: Changes remote start span from error to info.

Open nathansizemore opened this issue 1 year ago • 10 comments

This fixes initial connection span being treated as error instead of info.

Type of change

Miscellaneous (related to maintenance)

Checklist:

  • [x] Formatted with cargo fmt
  • [x] Make an entry to CHANGELOG.md if it's relevant to the users of the library. If it's not relevant mention why.

nathansizemore avatar Jan 06 '24 12:01 nathansizemore

it was set to error_span! because otherwise tracing won't capture / show additional fields with error log normally ( unless we set verbosity to -vvv )

see #510

swanandx avatar Jan 06 '24 13:01 swanandx

This causes false error types to show up in logs, through the logging facade tracing is built upon. So in the instance of the remote link being established, an error log is produced: image

This shows itself when logging blocks are built around the log facade directly (which we do for our network logger). This causes false alarming in our stuff, which we'd like to decrease.

nathansizemore avatar Jan 06 '24 14:01 nathansizemore

This also would bring in the same span type as websocket remote. It uses info, this uses error.

nathansizemore avatar Jan 06 '24 14:01 nathansizemore

This shows itself when logging blocks are built around the log facade directly

can you elaborate more? didn't get it :sweat_smile:

swanandx avatar Jan 06 '24 14:01 swanandx

This shows itself when logging blocks are built around the log facade directly

can you elaborate more? didn't get it 😅

Sure. If you look at the Rust ecosystem logging, after the dependency chain is done, you'll usually end with log. It doesn't do any logging itself, but provides the schemes to do so with the levels and macros. A common pattern is good, everyone can use it, and almost all crates do, this does, we do with ours, etc... Because it isn't a logger directly, just an interface, anyone can build a logger up around it to their own needs. Using us as an example, we have 3 impls in our stuff:

  • One that intercepts logs and sends them to stdout
  • One that intercepts logs and sends them to a file
  • One that intercepts logs and sends them to network store, where metrics, dashboards, alarming, etc... are centrailized.

Log is nice because each of those can have their own impl and format as needed. Log is also nice because you get all the other logs from applications that use it. However, this can also be a downside when people use logging differently. The case of rumqttd and us, it is great that we don't have to fork the broker and put in specialty logs because rumqttd uses log, but because rumqttd uses that error span, that shows up as error type on our end, because we log error types as error types. This causes a problem in the network logger because alarms and dashboards are built around it.

If for example, 100 clients connect at the same time, this produces 100 "error" logs of them connecting because of that error span. In alarming, if we have an alarm that is triggered when x error types reach a certain trreshold, this triggers that alarm, someone has to investigate, etc...

nathansizemore avatar Jan 06 '24 14:01 nathansizemore

because rumqttd uses that error span, that shows up as error type on our end

wdym by error type ?

btw, error_span! is used in other places, like for all events: https://github.com/bytebeamio/rumqtt/blob/8c043397fb55f40f037b7e7079e5b584d77095b7/rumqttd/src/router/routing.rs#L236 and other places in routing.rs so why only this error span trigger it?

maybe some issue with instrument()?

note: rumqttd uses tracing instead of log :smile:

swanandx avatar Jan 06 '24 15:01 swanandx

because rumqttd uses that error span, that shows up as error type on our end

wdym by error type ?

The log type comes in as "error", not info, debug, trace, etc...

btw, error_span! is used in other places, like for all events:

https://github.com/bytebeamio/rumqtt/blob/8c043397fb55f40f037b7e7079e5b584d77095b7/rumqttd/src/router/routing.rs#L236 and other places in routing.rs so why only this error span trigger it?

maybe some issue with instrument()?

Maybe it is just when instrument is called with it, this is the only spot it happens.

note: rumqttd uses tracing instead of log 😄

Yes, but if you run the dependency chain all the way down, you are using log :): https://github.com/tokio-rs/tracing/blob/master/tracing/Cargo.toml#L35

nathansizemore avatar Jan 06 '24 15:01 nathansizemore

Maybe it is just when instrument is called with it, this is the only spot it happens.

Might be issue with instrument then? 👀 We should investigate it further to figure out actual cause!

swanandx avatar Jan 06 '24 16:01 swanandx

Maybe it is just when instrument is called with it, this is the only spot it happens.

Might be issue with instrument then? 👀 We should investigate it further to figure out actual cause!

Not particularly a problem with instrumented. The problem is saying that events emitted inside this span are to be treated as errors, which I do not think is ultimately what is wanted.

When a thing is instrumented, anytime a poll or drop occurs, the span is entered:

impl<T: Future> Future for Instrumented<T> {
    type Output = T::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let (span, inner) = self.project().span_and_inner_pin_mut();
        let _enter = span.enter();
        inner.poll(cx)
    }
}

In our case, instrumenting begins, new connection happens, thing is polled, event is emitted, that event is an error event because the span it is in is an error span.

In regards to #510, it was changed from info to error not because the level needed changed for events that occurred, but because changing it to error was a way to get auto information logged. Keeping the span as info, and adding in the pieces of information to the log where needed is the way to go imo.

After checking, this does effect all other logs. I just didn't notice immediately because I forgot I muted all of routing because of this reason as weill:

    journal::init(
        env!("CARGO_BIN_NAME"),
        env!("CARGO_PKG_VERSION"),
        Some(vec!["rumqttd::router::routing".to_owned()]),
    );

(The last param is a vec of modules to blacklist from our logging). Currently it is blacklisted because everything reports as error type.

nathansizemore avatar Jan 06 '24 17:01 nathansizemore

After checking, this does effect all other logs. I just didn't notice immediately because I forgot I muted all of routing because of this reason as weill

oh, then it makes sense. ThenTtis means we should also change error_span! to appropriate span level in routing.rs as well right?

Keeping the span as info, and adding in the pieces of information to the log where needed is the way to go imo.

adding those pieces, like client_id/connection_id in every log defeats the purpose of span, we should think of some other way to tackle it.

I will see if there is something we can do at tracing_subscriber end :eyes:

swanandx avatar Jan 08 '24 09:01 swanandx