rust-lightning icon indicating copy to clipboard operation
rust-lightning copied to clipboard

Introduce structured, span-based observability through Logger interface

Open wvanlint opened this issue 8 months ago • 18 comments

This change allows users to create hierarchical span objects through the Logger interface for specific computations, such as the handling of HTLCs. These span objects will be held in LDK across the corresponding lifetimes before being dropped, providing insight in durations and latencies.

This API is designed to be compatible with https://docs.rs/opentelemetry/latest/opentelemetry/trace/trait.Tracer.html, but can also be used more directly to derive time-based metrics or to log durations.

Hierarchical RAII spans are currently added for:

  • HTLC lifetimes, HTLC state transitions, inbound to outbound HTLC forwarding (see functional_tests.rs).
  • Ping/pong request-response pairs.

wvanlint avatar May 02 '25 21:05 wvanlint

👋 I see @tnull was un-assigned. If you'd like another reviewer assignment, please click here.

ldk-reviews-bot avatar May 02 '25 21:05 ldk-reviews-bot

Codecov Report

:x: Patch coverage is 88.40749% with 99 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 88.99%. Comparing base (ed42300) to head (a7f103c). :warning: Report is 71 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/ln/channel.rs 87.36% 28 Missing and 40 partials :warning:
lightning/src/ln/channelmanager.rs 71.69% 15 Missing :warning:
lightning/src/util/logger.rs 72.72% 6 Missing :warning:
lightning/src/chain/channelmonitor.rs 0.00% 3 Missing :warning:
lightning/src/ln/invoice_utils.rs 0.00% 3 Missing :warning:
lightning-dns-resolver/src/lib.rs 0.00% 1 Missing :warning:
lightning-net-tokio/src/lib.rs 0.00% 1 Missing :warning:
lightning/src/ln/onion_payment.rs 83.33% 0 Missing and 1 partial :warning:
lightning/src/util/test_utils.rs 95.65% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3766      +/-   ##
==========================================
+ Coverage   88.94%   88.99%   +0.05%     
==========================================
  Files         173      174       +1     
  Lines      123876   124808     +932     
  Branches   123876   124808     +932     
==========================================
+ Hits       110177   111077     +900     
+ Misses      11248    11247       -1     
- Partials     2451     2484      +33     
Flag Coverage Δ
fuzzing 22.90% <44.85%> (+0.72%) :arrow_up:
tests 88.82% <88.40%> (+0.05%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 02 '25 22:05 codecov[bot]

Haha no worries, I imagined it it wasn't the same but wanted to write this up as a starting point.

I agree on the difference between function scope spans and higher-level state machine spans and how they answer different questions. Given the amount of logic that depends on messages, timers, and asynchronous persistence completions, I think we would need both complementing each other to have a complete understanding.

  • Function scope spans are clear and easy to optimize for, but don't cross certain boundaries and would not give a complete picture. User trait implementations are already easy to measure but it's difficult to decide when it's impactful to improve them.
  • Higher-level spans around the Lightning state machine would give an end-to-end view for Lightning payments and can identify rough areas of investigation.

If we're building higher-level state machine spans on top of messages outside of LDK as in https://discord.com/channels/915026692102316113/978829624635195422/1296877062015029258, would it lead to rebuilding/duplicating something similar to the Lightning state machine to get the full picture? It seems like channel.rs already has all the necessary information.

I'm also curious what we should optimize for. I'm assuming we would like to have maximal observability, while minimizing overhead. Is maintaining span objects in the Lightning state machine too invasive? Would there be a more minimal event API that can be added to channel.rs?

Having start(span: Span, parent: Option<&Span>) -> () and end(span: Span) -> () methods without the RAII concept could be less invasive, possibly with some helpers. We could also still verify in TestLogger whether all spans are ended correctly during tests.

wvanlint avatar May 05 '25 22:05 wvanlint

Oops, was travelling last week.

Of course using only functional blocks as spans makes it less able to answer the types of questions I think you specifically want, eg "how long is it taking us to forward most HTLCs", but as it lies its hard to take that and answer "why is it taking us long to forward some HTLCs".

I think this is somewhat the key point - we want to be able to answer questions like how long its taking to send/receive/forward an HTLC but also importantly why.

That may imply we need a bit of both - functional spans for CPU-blocked blocks (like process_pending_htlc_forwards) as well as "state step" notifications for HTLCs - ie something more like the following on the HTLC side:

enum HTLC { Inbound(InboundHTLCStateDetails), Outbound(OutboundHTLCStateDetails + AddedToOutboundQueue) } fn htlc_state_transitioned(channel_id, htlc_id: u64, payment_hash, new_state: HTLC);

This still wouldn't quite answer the "why it it taking so long to forward HTLCs", but maybe if we joined it with blame on wait-time inside a channel:

fn channel_waiting_on_peer_response(channel_id);
fn channel_peer_responded(channel_id);
fn channel_monitor_persist_start(channel_id, monitor_update_id);
fn channel_monitor_persist_finish(channel_id, monitor_update_id);
fn waiting_on_async_signer(channel_id);
fn async_signer_complete(channel_id);

From there we could answer the above question by looking at how long individual state transitions took - from InboundHTLCStateDetails::AwaitingRemoteRevokeToAdd to InboundHTLCStateDetails::Committed we can look at how much time was spent waiting on a peer vs waiting on a monitor persist or an async signing op. Then the latency from there to a corresponding HTLC::Outbound::AddedToOutboundQueue is on us (visible via monitor and async signer latency), and then the time from there to OutboundHTLCStateDetails::Committed can be seen in the ratio between waiting-on-peer and monitor/async-signing.

It's pretty gnarly to actually answer the why, here, sadly, though - you have to track the HTLC:channel relation and then track time in the channel "states" during the time the HTLC is in different states...I suppose we could break out the HTLC state transitions into channel sub-states like

enum ChanState {
WaitingOnPeerResponse,
WaitingOnMonitorPersist,
WaitingOnAsyncSigner,
NotWaiting
}
fn htlc_state_transition(channel_id, htlc_id: u64, payment_hash, state: HTLC, channel_state: ChanState)

but that would require calling htlc_state_transition once for each HTLC in a channel every time the channel updates, which seems kinda ugly (I guess we could skip it if the HTLC is in the Committed state which would reduce it to only HTLCs that are changing) and it would still require tracking the HTLCs as they cross channel boundaries when forwarded.

Of course this is all somewhat orthogonal to whether to do things RAII - either we require the user do the mapping on their side or we store the RAII objects and forward them around. Personally I prefer to not store user objects in our state machine, but I admit I don't really have a strong argument for not doing so - there's not really much difference in terms of runtime switching, but it does seem maybe-useful for the user to get a &self on the span closure. Its also less clear to me how to map the state transitions onto a Span, I guess we'd just be calling htlc_state_transition on the span object.

TheBlueMatt avatar May 13 '25 19:05 TheBlueMatt

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Jul 29 '25 09:07 ldk-reviews-bot

🔔 2nd Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Jul 31 '25 09:07 ldk-reviews-bot

🔔 3rd Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 02 '25 09:08 ldk-reviews-bot

🔔 4th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 04 '25 09:08 ldk-reviews-bot

Apologies for the delay here, a project at work came up that needed to be prioritized.

I rebased the PR and added additional WaitingOn* spans as leafs in the HTLC span hierarchy. I still need to verify completeness there.

It's pretty gnarly to actually answer the why, here, sadly, though - you have to track the HTLC:channel relation and then track time in the channel "states" during the time the HTLC is in different states...I suppose we could break out the HTLC state transitions into channel sub-states

Right, the HTLC:channel relation can't be represented simply, but you can indeed break out the spans from the viewpoint of an HTLC, which feels like the right approach to me. If an HTLC needs to wait for a remote revocation, an HTLC-specific span can be started at that point in time until it arrives, even though other HTLCs might have started waiting for that revocation beforehand. These spans would be specifically from the viewpoint of each HTLC, but I believe it could explain the why? When complete, the WaitingOn* leaf spans should fully cover the lifecycle of an HTLC.

The spans from the viewpoint of each HTLC might be complemented by other metrics/spans, such as measuring the duration of Persist calls more directly.

wvanlint avatar Aug 05 '25 01:08 wvanlint

🔔 5th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 06 '25 09:08 ldk-reviews-bot

🔔 6th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 09 '25 00:08 ldk-reviews-bot

🔔 7th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 11 '25 00:08 ldk-reviews-bot

🔔 8th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 13 '25 00:08 ldk-reviews-bot

🔔 9th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 16 '25 00:08 ldk-reviews-bot

🔔 10th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 18 '25 00:08 ldk-reviews-bot

🔔 11th Reminder

Hey @tnull! This PR has been waiting for your review. Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot avatar Aug 20 '25 00:08 ldk-reviews-bot

We discussed this a bit at the meeting last Monday, and I suggested looking into a more event-based approach, which is simpler to track on the LDK end, and, I think, done right, doesn't require a lot more tracking on the client end to get the same results, but also might allow other types of results for users who want it. I'm happy to discuss further here if you prefer the more tracing approach we have here.

TheBlueMatt avatar Aug 20 '25 19:08 TheBlueMatt

Linking https://github.com/lightningdevkit/rust-lightning/pull/4211 where an attempt is made to get rid of logging boiler plate and also enable the use of spans through standard tracing.

joostjager avatar Nov 06 '25 18:11 joostjager