rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

feat(transport): Metrics oriented connection upgrade

Open dhuseby opened this issue 2 years ago • 8 comments

Description

I am looking for a way to collect aggregate metrics for each stream/substream at the transport level. I think the best way to do this is to create a connection upgrade that just passes buffers through while also sending tuples of timestamp, direction (e.g. in or out), and buffer size over a user-provided bounded channel. I think this would allow upgrading of arbitrary/all streams/sub-streams with each upgraded connection sending their measurements over a bounded mpsc channel to logic that then calculates the aggregate metrics over time.

Motivation

Currently it is not simple/possible to collect metrics for all/some of the network activity at the transport level--below any protocols. These metrics are useful for analytics and debugging as well as nifty "stats for geeks" displays in rust-libp2p applications.

Requirements

  1. Upgrade a connection to add measuring the in/out bytes passing through it.
  2. Report the timestamped and annotated measurements over an mpsc channel if one has been supplied.
  3. Examples for upgrading single connections and all connections along with an example behavior that consumes the metrics and emits behavior events with aggregate metrics based on the config.
  4. Maybe the example network behavior speaks a protocol that allows for the aggregate metrics to be sent over the network to another peer.

Open questions

  1. Is a connection upgrade the correct/best way to do this?
  2. Is the consumption of the measurements and the generation of aggregate metrics best done in a network behavior?

Are you planning to do it yourself in a pull request?

Maybe

dhuseby avatar Aug 15 '23 19:08 dhuseby

Does https://docs.rs/libp2p/latest/libp2p/trait.TransportExt.html#method.with_bandwidth_logging help your usecase?

Perhaps the metrics module should offer a collector that exports these metrics in prometheus form?

One of the tricky things here is that we can't scope stuff like bandwidth to peers or protocols because that would lead to a possible cardinality explosion in prometheus.

thomaseizinger avatar Aug 16 '23 09:08 thomaseizinger

Duplicate of https://github.com/libp2p/rust-libp2p/issues/3262.

mxinden avatar Aug 17 '23 09:08 mxinden

#3180 and #3262 only give totals. The main idea for #4331 was to emit a stream of tuples containing a timestamp, direction and buffer size on every connection. The idea being that a "connection manager" receiving the stream of events could do instantaneous bandwidth calculations on a per-stream and aggregate basis as well as maintain totals. This is similar to the idea of a "connection manager" responding to ping behavior messages to implement connection policies:

https://github.com/libp2p/rust-libp2p/blob/cbdbaa836e1159fb4e8b20e76c32f1af5ec66926/protocols/ping/src/lib.rs#L32-L46

dhuseby avatar Aug 17 '23 23:08 dhuseby

Thanks for clarifying! I'd still see this as a duplicate of https://github.com/libp2p/rust-libp2p/issues/3262 by changing https://github.com/libp2p/rust-libp2p/issues/3262 into reporting these stats per connection. The comment on https://github.com/libp2p/rust-libp2p/issues/3262 also applies to this issue here because the connection being emitted does not yet have an identifier at the time it is created. The identifier is only assigned within Swarm and cannot be accessed from within the Transport implementation.

thomaseizinger avatar Aug 18 '23 09:08 thomaseizinger

So with #3180 giving total aggregate bytes at the transport level, it would be interesting to also have the same thing on a per-stream basis. Mostly because I want to do some performance benchmarking of a single stream at the stream level while also benchmarking at the transport level to measure the loss caused by our stack.

Also the design for the transport metrics makes me wonder if the Arc + atomic counter is the fastest way to get the information out of the transport.

dhuseby avatar Aug 23 '23 00:08 dhuseby

So with #3180 giving total aggregate bytes at the transport level, it would be interesting to also have the same thing on a per-stream basis.

That would be great! Streams don't have a public identity (yet) so I wouldn't know how to implement this.

I think all muxers have some kind of internal ID so perhaps we can reuse that or we introduce our own tracking similar to listeners and connections.

The latter would have the advantage that we could scope the stream ID to the connection and do something like 1/1 (stream 1 on connection 1).

thomaseizinger avatar Aug 23 '23 05:08 thomaseizinger

Note that we now have metrics per transport protocol: https://github.com/libp2p/rust-libp2p/pull/4727

Adding more metrics to also track it per protocol is more difficult. One reason for that is that we shouldn't just use the protocol as a prometheus label. The number of protocols is technically unbounded and could thus lead to a cardinality explosion. Hence, to be safe, we'd have to limit at compile-time, which protocol we add as a label.

At that point, I am no longer sure how useful that is ..

thomaseizinger avatar Nov 23 '23 21:11 thomaseizinger

The number of protocols is technically unbounded and could thus lead to a cardinality explosion.

I argue that the number of locally supported protocols is likely bounded. Thus, whenever a protocol is negotiated (i.e. supported by the local node) it is fine to use it as a Prometheus label.

Either way, the problem described in https://github.com/libp2p/rust-libp2p/issues/3262#issuecomment-1682203245 still stands, i.e. how to get a hand on the protocol name in the first place.

mxinden avatar Nov 24 '23 09:11 mxinden