go-libp2p-kad-dht Towards better visibility, debuggability and diagnostics

The DHT is a pretty central element of the libp2p stack. As our adoption grows, users demand better visibility, debuggability and diagnostics. This issue pulls together ideas we've discussed.

Metrics

We need a way to collect and expose metrics on a per-query basis (and return a stats object as a third argument from methods), as well as global moving aggregates/accumulators that can be queried anytime (or dumped periodically through an exporter like Prometheus).

When looking up a value, how many peers I did I query? How many queries were responded with a value vs. with closer peers? What were the min/avg/max RTT times?
When storing a value, how many peers did I store it in?
When looking up a peer, how many peers did I have to ask?
min/avg/max RPC times per message per operation.
failure counting.

Debuggability/diagnostics

Introspective queries like the following will provide better management and diagnostics of the DHT.

What addresses of mine are stored in the DHT? Are they as expected?
What DHT records do I currently hold? Who have I served them to?
When did a record get created? Which peer ID stored the record? When was it last queried?
What provider records do I hold? When do they expire? Are the nodes I'm pointing to still alive?
Dump the routing table. Trace routing table changes.

Some of these require additional bookkeeping. Some are too expensive/voluminous to track by default: they should be switched off OOTB, and users should opt-in explicitly knowing the implications.

Queriability

Collecting this wealth of information would be fruitless if we didn't expose it to the user via tooling. Unfortunately libp2p lacks an instrumentation/monitoring/management subsystem (for now) to serve as a sink for all this data. A transitory, simple solution is to expose these metrics via a local gRPC endpoint or similar, and develop a command line tool (similar to ipfs dht) that serves as a frontend.

Feb 11 '19 15:02 raulk

I think this issue is too general.

What addresses of mine are stored in the DHT? Are they as expected?

This can be done with a query to the DHT to find out.

What DHT records do I currently hold? Who have I served them to?

This can be determined by looking at the appropriate data store.

When did a record get created? Which peer ID stored the record? When was it last queried?

I'm not sure about this. Why? It's a lot of metadata.

What provider records do I hold? When do they expire?

Datastore again.

Are the nodes I'm pointing to still alive?

Routing table.

Dump the routing table.

Routing table

Trace routing table changes.

This is an interesting one, and very useful for debugging. A logger subsystem, or a few callbacks to allow users to interpret it how they wish would achieve this.

Can we process the specifics and generate specific issues from this? We need to keep focused.

Feb 12 '19 00:02 anacrolix

@anacrolix Sure, go ahead. If you don't mind, just add backlinks from the children issues into this one, so we can treat it as an epic.

Feb 12 '19 00:02 raulk

One debug metric I wanted for a long time is a number of items in each of kbuckets being exported as a metric. This would allow to debug/discover some possible implementation errors.

Feb 12 '19 00:02 Kubuxu

@Kubuxu On that subject, take a look at this: https://github.com/libp2p/go-libp2p-kad-dht/issues/194. I can tell you the answer already: 7 furthest buckets are full, 8th is half full, the remaining 248 logical buckets are empty with an extremely high likelihood.

P.S.: But yeah, that metric makes sense as a digest of the full routing table dump.

Feb 12 '19 00:02 raulk

Can we close this and create a metrics label? Super issues are too fluffy and conversation will be interleaved across different metrics.

Feb 19 '19 09:02 anacrolix

Let’s do both. Keep this one as an epic that serves like a user/passer-by entrypoint for discussion. Also open issues for the specific stuff we’ve decided to implement. I like the label.

Feb 19 '19 09:02 raulk

All the metrics stuff can be addressed by #252, #300, and #297.

Mar 18 '19 23:03 anacrolix

A list of metrics is tracked in #304.

Mar 21 '19 02:03 anacrolix

What's the overall state of metrics in libp2p? Right now I'm specially interested in two -- https://discuss.libp2p.io/t/how-to-know-of-peers-dialed-of-dials-failed-per-each-find-peers-find-providers-query/341/4 --

Nov 06 '19 21:11 daviddias

For reference: Here is the url to the docs of the Stats API in js-libp2p that @pgte created long time ago -- https://github.com/libp2p/js-libp2p#switch-stats-api

Nov 06 '19 21:11 daviddias

Can we get the metrics by query exported https://github.com/libp2p/go-libp2p-kad-dht/blob/master/query.go#L106-L110 ? It would help me understand the efficiency of our routing

Mar 10 '20 08:03 daviddias

@daviddias those details would be part of a trace, because they are transactional metrics, i.e. they pertain to a particular transaction in the system. I don't think there's much value in calculating averages, counts and percentile distributions globally (which is what OpenCensus metrics are about -- runtime stats).

Mar 10 '20 13:03 raulk

@daviddias those details would be part of a trace, because they are transactional metrics, i.e. the pertain to a particular transaction in the system.

That would work for the usecases I can think of 👍

Update: Ah! When I said export, I wasn't thinking in the "Export from the Golang package sense". I was just looking to have access to the information, hence a trace would be perfect!

Mar 10 '20 13:03 daviddias

go-libp2p-kad-dht go-libp2p-kad-dht copied to clipboard

Towards better visibility, debuggability and diagnostics

Metrics

Debuggability/diagnostics

Queriability

go-libp2p-kad-dht
go-libp2p-kad-dht copied to clipboard