tracing-forest icon indicating copy to clipboard operation
tracing-forest copied to clipboard

Long running span detection / warning

Open Firstyear opened this issue 3 years ago • 3 comments

In non-release builds, if a span is held open for more than 5 seconds, it should log a warning. You should be able to check this by setting a SystemTime at the time the span is opened, and the next time a write occurs to the span, checking that time difference.

This way you can detect and warn about potential span-leaks for unbounded memory growth (think socket listener).

Firstyear avatar Jan 24 '22 00:01 Firstyear

Can you expand on what scenarios this could occur in (i.e. socket listeners)? I want to better understand why this would ever happen.

I'm thinking this is more application specific. However, it could be implemented as an opt-in feature using a span's fields. For example, we could have something like

let span = info_span!(warn_after = "5s", "socket-listener");

QnnOkabayashi avatar Mar 14 '22 04:03 QnnOkabayashi

So consider some code like this:

https://github.com/kanidm/kanidm/blob/master/kanidmd/score/src/ldaps.rs#L142

If we attached a span to this tcp acceptor, because the infinite loop exists it will never close the span and so it will continue to build a tree up that won't be flushed till server shutdown.

Similar for a connected client, https://github.com/kanidm/kanidm/blob/master/kanidmd/score/src/ldaps.rs#L47 this will exist for as long as a client is connected, which could be a very long time depending on various factors. I've seen operationally, ldap clients connect and remain connected for hours at a time and issue thousands if not millions of operations. (IE mail routers).

So there are a few issues here:

  • When we are in a span like in acceptor, and we create a new task, is the new task detacted and has it's own span context?
  • Can we detect when a span is open for a long time? This way we know that we potentially have a memory leak / high memory usage condition being created?

Some ways to potentially go about this:

  • If the span has existed for more than 60seconds, then log a crit level warning immediately. You can detect this by checking a time window between "time of span creatation" and "time of log message being written". Could be a debug build only thing. And also log the warning every time it's detected, be noisy as heck :)
  • Have a long-running span handling code which issues infrequent flushes of the state so a span can exist for a long time and periodically writes out it's log state. This mayyyyy not play too well with the tree nature of this library though.
  • Have a flag on spans of some time that turns them into immediate processors?

What do you think?

Firstyear avatar Mar 14 '22 23:03 Firstyear

  • If the span has existed for more than 60seconds, then log a crit level warning immediately. You can detect this by checking a time window between "time of span creatation" and "time of log message being written". Could be a debug build only thing. And also log the warning every time it's detected, be noisy as heck :)

If I'm understanding this correctly, you would need to be very careful that the generated warning log isn't triggering another warning to be emitted to avoid recursion.

Also, it seems that this behavior is more Kanidm specific, and should be implemented as it's own Layer.

QnnOkabayashi avatar Mar 27 '22 15:03 QnnOkabayashi