narwhal icon indicating copy to clipboard operation
narwhal copied to clipboard

[logging] Add a span for the validator ID to help distributed logging

Open huitseeker opened this issue 1 year ago • 3 comments

We are starting to read federated logs from several validators using e.g. our docker-compose validators.

Our aggregation techniques for this are not great: they jumble the provenance of each log line to the same stdout / file.

One simple way of making. "who sent which line of logs" easily understandable would be to instrument with a default tracing span that would initiate this log with an identifier of the logging validator (e.g. first few bytes of the public Key).

/cc @velvia @bmwill @allan-bailey who may have contrasting ideas /cc @sadhansood @asonnino who are in the midst of those debugging efforts.

huitseeker avatar Jul 18 '22 19:07 huitseeker

+1 on this one, with the e2e tests we introduce recently (ex see here) I see more and more the need for it. A point to add is that we might want to avoid any approach that will use a static/thread local as this won't work well when we bootstrap multiple in memory nodes (ex via Cluster).

akichidis avatar Jul 25 '22 12:07 akichidis

@huitseeker so here's the thing. You could do some global instrumentation for node and stuff like that in the app, and that helps docker-compose, but usually this is not done because usually it's not necessary.

The reason is that usually, you deploy in the cloud, and cloud log aggregation infrastructure, even CloudWatch and simpler stuff like Loki, will automatically prepend container and host information to the log metadata.

So there are a couple ways we could go with this, that I can think of:

  1. Standardize on a setup closer to what is deployed (eg local Kubernetes instead of Docker-compose), and use Loki etc (eg the exact same setup as deployed) which automatically gives you per-node logs (and metrics). The problem with this is that it's much more resource-heavy for laptops.
  2. We can add some library functions, macros etc to automatically add in hostname etc. However, this would be duplicate work and unnecessary for anything deployed, it is strictly to make docker-compose setup easier.
  3. We could add a script to automatically prepend and aggregate logs from multiple docker compose containers. A poor-man's log aggregator.

velvia avatar Jul 26 '22 17:07 velvia

@velvia in practice, we do not have the ability to differentiate between "logical" hosts in Cluster tests (where we emulate multi-host with multi-processing or multi-tasking). We can and do filter logs per pod in Loki / Cloudwatch, but that prevents seeing timed correlations in logs that happen to be time stamped. Sometimes, it's useful to look at a merged view of all of them ... which makes zero sense without tracking provenance. I know, eyeballing distributed interactions from log time stamps is a hazardous exercise .. but not a useless one.

I'm really looking for a simple approach that works in all contexts here.

huitseeker avatar Jul 30 '22 20:07 huitseeker