lighthouse
lighthouse copied to clipboard
feat: Integrate tracing
Issue Addressed
Tracing Integration
Proposed Changes
- [x] replace slog & log with tracing throughout the codebase
- [x] implement custom crit log
- [x] make relevant changes in the formatter
- [x] replace sloggers
- [x] re-write SSE logging components
cc: @macladson @eserilev
Tracing
Slog
at the moment this is what a crit log looks like
its labeled as ERROR but with an error_type = crit. Im wondering if we can tweak the formatting here to display CRIT in purple instead of ERROR (as we have currently for our slog impl)
Thanks again for the feedback !!
Most of my suggested changes revolve around making sure we stick to display formatting vs debug formatting.
- Trying to push all the fixes asap
I think we have some options to simplify the crit! macro a bit For crit level errors we'll want to display the Crit log similarly to how they are currently displayed with slog. This will involve implementing a custom tracing layer that takes in those crit logs and formats them appropriately.
- For crit! macro,I took the error! macro's code from here(didn't tweak it much,so it's still bloated) , just included the
error_type = critfield there for now, planning to handle it's formatting while re-writing the custom-formatter
In situations where we use slog:o! I think we can replace those with a tracing span
- Can we maybe remove that thing totally? Not too sure about this but I think most of the places where
o!is used will get removed eventually
Can we maybe remove that thing totally? Not too sure about this but I think most of the places where o! is used will get removed eventually
We definitely want to remove slog:o!, but we still want the additional context that it provides. The alternative that tracing provides is called a span
https://docs.rs/tracing/latest/tracing/span/index.html
Tracing
Slog
Tweaked the formatter, here's what the logs look like now(as per the suggestions ,tried to keep the format somewhat similar to what we had before with slog)
at the moment this is what a crit log looks like
its labeled as
ERRORbut with anerror_type = crit. Im wondering if we can tweak the formatting here to displayCRITin purple instead ofERROR(as we have currently for ourslogimpl)![]()
Done!
⚠️ The sha of the head commit of this PR conflicts with #6772. Mergify cannot evaluate rules on this PR. ⚠️
I did a little bit of testing with this branch. Some of the incompatibilities I noticed. I'm not totally against having new behaviour, but would like to find solutions for these if possible:
- The discv5 logs are present in the logging by default. I think by default, its better to not pollute lighthouse logging with discv5 specially as debug logs in discv5 are really verbose. Having the discv5 logs being redirected to a separate file is good.
- The logs from stdout when written to file appear with colours which adds additional characters that make it hard to read in any editor. For e.g. this is what I see from vscode
I think we can have match the slog behaviour by initializing the tracing with
.with_ansi(false). I would really prefer for this to be the default behaviour, not sure if this is an issue for others as well - The service: XYZ suffix doesn't exist anymore. I think this is unavoidable as we aren't passing around a log object but wanted to just highlight the new behaviour. Not sure if it affects indexing in caribou? cc @antondlr
Gonna continue testing this tomorrow
@pawanjay176
The service: XYZ suffix doesn't exist anymore. I think this is unavoidable as we aren't passing around a log object but wanted to just highlight the new behaviour
We actually should be including the service name in logs. Tracing has this concept of "spans" which allows us to add context to logs. We've tried to init these spans everywhere a slog:o! macro used to exist. So if you're not seeing the service name in logs it could be that we either missed initializing a span for one of the services, or maybe the log formatting is excluding the service name entirely.
I like the coloured logging to file. I think that makes it nicer and easier to read. Except for those red thingy's in there, but I assume thats some vscode thingy
@ThreeHrSleep - Looks like there's still some todo's in here. I've gone through and fixed most of the stuff @macladson has flagged, there is a lot of comments still in one file.
When you get a sec, could you go through and try and make sure there are no more todo's for this tracing PR left before we merge it.
In the next day or so, I'll have a look over the SSE part, which is what @macladson has pointed out. I'll try and fix anything in there that needs to be done and remove any todo's in that part. But would be good to have another check over everything else to make sure we're not missing todo's anywhere.
I had a quick skim over the SEE logging and it looks like its implemented.
I'll try running some versions like pawan and see if I notice anything and will try the SSE logging.
I'll also give this another skim for todo's and some other things like what @macladson has discovered.
Apologies for the delay & for leaving that much mess in there. Thanks for fixing the stuff up!
Dealt with whatever tracing-related TODOs were left , Let me know if I missed/broke something (again) : )
introduce logfile-color flag to enable/disable coloured logging to file
oh thank you, as someone who has to parse logs I've got ... thoughts ... about ANSI control characters in plain text files 🙃
I'm going to resolve the conflicts caused by the redb merge
Resolved merge conflicts. Only thing of note is that I dropped the blobs_pruned from the Blob pruning complete log line. I'm planning to revisit that in a follow-up PR soon anyway, so just kept it simple for now.
Fixed Clippy too: https://github.com/sigp/lighthouse/pull/6339/commits/e5f016a7a7fc6a696a7b47a5747804e1e7371ccc
Doppelganger failure is exit code 137, which implies OOM. We should try:
- [ ] Running a node on our infra (Holesky is fine) and observing log volume + memory usage.
- [ ] Running the doppelganger tests locally to observe mem usage
Doppelganger failure is exit code 137, which implies OOM. We should try:
- [ ] Running a node on our infra (Holesky is fine) and observing log volume + memory usage.
two remarks;
- we have an OOM 'snitch', if an OOM happens on our infra we will get alerted.
- observing log volume may be non-trivial, as I expect there may be some work to be done wrt. parsing. this needs to happen anyway though, I'll see if anton or I can carve out some time.
This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏
This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏
This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏
This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏
Just want to point out,
As we currently have gossipsub integrated into Lighthouse (i.e its currently not strictly a dependency), we are showing tracing logs to std out for gossipsub.
Once we shift to the rust-libp2p fork, I'd imagine these will be sorted.
cc @jxs - Maybe you could link the PR that shifts us away from internal gossipsub.
Just want to point out,
As we currently have gossipsub integrated into Lighthouse (i.e its currently not strictly a dependency), we are showing tracing logs to std out for gossipsub.
Once we shift to the rust-libp2p fork, I'd imagine these will be sorted.
cc @jxs - Maybe you could link the PR that shifts us away from internal gossipsub.
thanks age, I am working on https://github.com/sigp/lighthouse/pull/6905 first, do we require that PR before merging this one?
Yeah, lets merge that before this.
I've been running this PR and finding quite a few issues.
For example:
Feb 05 13:45:59.120 DEBUG Requesting batch epoch=339076 batch=Start Slot: 10850432, End Slot: 10850464, State: Downloading batch_state=[P,p,d,E,E] service="sync"peer_id=PeerId("16Uiu2HAkytpf5xwHoRu13aCHPmqwFPF4YM5QT2jeGyiiKUc81c3F") chain_id=1 batch_id=Epoch(339075) request_id=4 blocks=[RpcBlock(0x688d084607e4cd606ee769f88b30d19754702162363f1848238ab449746c40a0), RpcBlock(0x56bec02431edc14287be8588de5c8941ce1a680a35fc65f6f170686b7a34ca83), RpcBlock(0x1d07ca51299b7242ba500186ea614642da7ec8910bb93757a42bcec6cbebbd7d), RpcBlock(0x067b11e6bcd6e5a48951cdc0aa34645ac7a8893e48fa5a18f2ea792ee86e2942), RpcBlock(0xc128107c0ad80b27ca8b8a957c71fed2362c5255f92e7f1d4ba7f81f0d3e47b2), RpcBlock(0x8fea4a27e4d3057cdf394f28b8f78d6292b9b6cf9ca2ac1facc8eff5f52eb07d), RpcBlock(0x93f05d2bd573fbc72e107f7c12f90a2c074cce30060da142d579c8216a681c3c), RpcBlock(0xca3f8a367c5518c817456740962e215f23de1e571ac669e62896913e05331a62), RpcBlock(0xcab07a9e3e55a7b349f081c6cc06517f0cc747e1e8aa3e42a3aafcd055aba02a), RpcBlock(0x797785e5c47d3669800eaeed1e75de8c9805d7a869b99aea7e69a47924d5a2bb), RpcBlock(0xadcd2baf629dd26b8e357e1ff5e218b4bc19720b89912272fd390c967cd5bd2d), RpcBlock(0x37bef40220e3e5672b4c597fb2501c7f4e514a9884a6b7f733ca20721f9f4aaa), RpcBlock(0x28640f2dfb3b5ca6b824e7d63b4f7501565f1ebfa55ec0f7c5fd57f0b2ba2dcd), RpcBlock(0x305de6ea03f2b09c1629003f05cd081332678af7d5793ce807183f158bab3583), RpcBlock(0xf6810cd4011adb30ae8ce0ae7807c8277fe6a914d35a6206418dfcce9dcdeb4d), RpcBlock(0x0dcecba0cdae37b11df92485b3e25f3d03be0a2581043c932372d07651c3278d), RpcBlock(0xa9dad92b5bd5ec2683f8ed5a69a0984f171a96bab4c0e18785e774add44df2ac), RpcBlock(0x33f4a75d4e758ecd1f00020aa3bacb5744ae52eebabf2e1ebc20caa99088491e), RpcBlock(0xd0064d581b36255b8a608d413afbedbf0d65445050825c7d1b67fced669d7f0d), RpcBlock(0x9ca8805ea636e24faa4c146e65fec0bd0f51c75c7e6cc11c36c2cdb5fba6eae2), RpcBlock(0x521794d05fcaed12195b29ef6127dba41e87ed2af7479e5d4ab5c084671f4bed), RpcBlock(0x860efc8b0cf97dddfde3c4cca1576bcb180281bb1d3e565088cb2e86bf84d1a7), RpcBlock(0x865c297746c845cac196042da24139f4c307fb4bcce82889be7943f0ff31eb92), RpcBlock(0xd79d4306724462651076e34310795ebee6905b9973fa1850aa2421a392ebb66f), RpcBlock(0xaee9184eb6ec43f9a2fd97e5bfbb8a626e95c71b456ca5e92f300e293cfc293b), RpcBlock(0xe1174affef10a13dbd357deae738ac71836c50b37d49260ba12cdfde7a5693b0), RpcBlock(0xc6d2a7a6dcb6df2bed7674e1f4b60eef48a2cc5791bd56573f3b6455ed641834), RpcBlock(0x2079be6e78a4f89905370ca1dc565c8647ef14395c8d1faf447ec846cfcf1535), RpcBlock(0x5abac96bcd863bd3072972b0ccf344faa26ff221d19bb7edbf066f57e3278fa8), RpcBlock(0xf15d771e85fbc68774d8927151cb1bf1af0363118dca6ac59c3c567acf310bed), RpcBlock(0x117e49da2567d47cca1b2fa2323d018581be9caa6c2631845b5448abb8079732), RpcBlock(0xb720743e6b5f55272cac3f95218ba0423f140bb384632c70c17091b91fb60c29)] component="range_sync"batch_id=Epoch(339075) peer_id=PeerId("16Uiu2HAkytpf5xwHoRu13aCHPmqwFPF4YM5QT2jeGyiiKUc81c3F") request_id=4 blocks=[RpcBlock(0x688d084607e4cd606ee769f88b30d19754702162363f1848238ab449746c40a0), RpcBlock(0x56bec02431edc14287be8588de5c8941ce1a680a35fc65f6f170686b7a34ca83), RpcBlock(0x1d07ca51299b7242ba500186ea614642da7ec8910bb93757a42bcec6cbebbd7d), RpcBlock(0x067b11e6bcd6e5a48951cdc0aa34645ac7a8893e48fa5a18f2ea792ee86e2942), RpcBlock(0xc128107c0ad80b27ca8b8a957c71fed2362c5255f92e7f1d4ba7f81f0d3e47b2), RpcBlock(0x8fea4a27e4d3057cdf394f28b8f78d6292b9b6cf9ca2ac1facc8eff5f52eb07d), RpcBlock(0x93f05d2bd573fbc72e107f7c12f90a2c074cce30060da142d579c8216a681c3c), RpcBlock(0xca3f8a367c5518c817456740962e215f23de1e571ac669e62896913e05331a62), RpcBlock(0xcab07a9e3e55a7b349f081c6cc06517f0cc747e1e8aa3e42a3aafcd055aba02a), RpcBlock(0x797785e5c47d3669800eaeed1e75de8c9805d7a869b99aea7e69a47924d5a2bb), RpcBlock(0xadcd2baf629dd26b8e357e1ff5e218b4bc19720b89912272fd390c967cd5bd2d), RpcBlock(0x37bef40220e3e5672b4c597fb2501c7f4e514a9884a6b7f733ca20721f9f4aaa), RpcBlock(0x28640f2dfb3b5ca6b824e7d63b4f7501565f1ebfa55ec0f7c5fd57f0b2ba2dcd), RpcBlock(0x305de6ea03f2b09c1629003f05cd081332678af7d5793ce807183f158bab3583), RpcBlock(0xf6810cd4011adb30ae8ce0ae7807c8277fe6a914d35a6206418dfcce9dcdeb4d), RpcBlock(0x0dcecba0cdae37b11df92485b3e25f3d03be0a2581043c932372d07651c3278d), RpcBlock(0xa9dad92b5bd5ec2683f8ed5a69a0984f171a96bab4c0e18785e774add44df2ac), RpcBlock(0x33f4a75d4e758ecd1f00020aa3bacb5744ae52eebabf2e1ebc20caa99088491e), RpcBlock(0xd0064d581b36255b8a608d413afbedbf0d65445050825c7d1b67fced669d7f0d), RpcBlock(0x9ca8805ea636e24faa4c146e65fec0bd0f51c75c7e6cc11c36c2cdb5fba6eae2), RpcBlock(0x521794d05fcaed12195b29ef6127dba41e87ed2af7479e5d4ab5c084671f4bed), RpcBlock(0x860efc8b0cf97dddfde3c4cca1576bcb180281bb1d3e565088cb2e86bf84d1a7), RpcBlock(0x865c297746c845cac196042da24139f4c307fb4bcce82889be7943f0ff31eb92), RpcBlock(0xd79d4306724462651076e34310795ebee6905b9973fa1850aa2421a392ebb66f), RpcBlock(0xaee9184eb6ec43f9a2fd97e5bfbb8a626e95c71b456ca5e92f300e293cfc293b), RpcBlock(0xe1174affef10a13dbd357deae738ac71836c50b37d49260ba12cdfde7a5693b0), RpcBlock(0xc6d2a7a6dcb6df2bed7674e1f4b60eef48a2cc5791bd56573f3b6455ed641834), RpcBlock(0x2079be6e78a4f89905370ca1dc565c8647ef14395c8d1faf447ec846cfcf1535), RpcBlock(0x5abac96bcd863bd3072972b0ccf344faa26ff221d19bb7edbf066f57e3278fa8), RpcBlock(0xf15d771e85fbc68774d8927151cb1bf1af0363118dca6ac59c3c567acf310bed), RpcBlock(0x117e49da2567d47cca1b2fa2323d018581be9caa6c2631845b5448abb8079732), RpcBlock(0xb720743e6b5f55272cac3f95218ba0423f140bb384632c70c17091b91fb60c29)] chain=1chain=1batch_id=Epoch(339076) peer=PeerId("16Uiu2HAkytpf5xwHoRu13aCHPmqwFPF4YM5QT2jeGyiiKUc81c3F") chain=1
This is one log inside sync.
The log itself seems fine, but there is a span somewhere that is tagging on the peer-id and then every single block we are requesting. I can't seem to find that span. It's not obvious where it is. There are a few #instrument macros that could be adding fields.
I'm also seeing a log of this form:
Feb 05 14:02:55.408 DEBUG 26cc4449d18b2b484e091f5029e47582c1ec51f7 peer=16Uiu2HAkytpf5xwHoRu13aCHPmqwFPF4YM5QT2jeGyiiKUc81c3F service="network"
Where it looks like a hash is in the message. I'm not sure of its origin.
There is also a case where the std output fills my terminal with:
NonZeroUsizeOption(Some(486250)), NonZeroUsizeOption(Some(766546)), NonZeroUsizeOption(Some(241449)), NonZeroUsizeOption(Some(616444)), NonZeroUsizeOption(Some(868436)), NonZeroUsizeOption(Some(92619)), NonZeroUsizeOption(Some(63582)), NonZeroUsizeOption(Some(807922)), NonZeroUsizeOption(Some(273358)), NonZeroUsizeOption(Some(137276)), NonZeroUsizeOption(Some(108637)), NonZeroUsizeOption(Some(655503)), NonZeroUsizeOption(Some(788179)), NonZeroUsizeOption(Some(354018)), NonZeroUsizeOption(Some(458525)), NonZeroUsizeOption(Some(145836)), NonZeroUsizeOption(Some(881393)), NonZeroUsizeOption(Some(199209)), NonZeroUsizeOption(Some(650785)), NonZeroUsizeOption(Some(1041258)), NonZeroUsizeOption(Some(569716)), NonZeroUsizeOption(Some(687090)), NonZeroUsizeOption(Some(340013)), NonZeroUsizeOption(Some(115777)), NonZeroUsizeOption(Some(625378)), NonZeroUsizeOption(Some(941691)), NonZeroUsizeOption(Some(412690)), NonZeroUsizeOption(Some(826919)), NonZeroUsizeOption(Some(659581)), NonZeroUsizeOption(Some(176106)), NonZeroUsizeOption(Some(85779)), NonZeroUsizeOption(Some(852305)), NonZeroUsizeOption(Some(752133)), NonZeroUsizeOption(Some(340310)), NonZeroUsizeOption(Some(891332)), NonZeroUsizeOption(Some(275368)), NonZeroUsizeOption(Some(71986)), NonZeroUsizeOption(Some(203945)), NonZeroUsizeOption(Some(338833)), NonZeroUsizeOption(Some(98139)), NonZeroUsizeOption(Some(177744)), NonZeroUsizeOption(Some(324729)), NonZeroUsizeOption(Some(936444)), NonZeroUsizeOption(Some(542080)), NonZeroUsizeOption(Some(346708)), NonZeroUsizeOption(Some(252409)), NonZeroUsizeOption(Some(562131)), NonZeroUsizeOption(Some(1008023)), NonZeroUsizeOption(Some(745158)), NonZeroUsizeOption(Some(285852)), NonZeroUsizeOption(Some(1000753)), NonZeroUsizeOption(Some(621902)), NonZeroUsizeOption(Some(753793)), NonZeroUsizeOption(Some(214243)), NonZeroUsizeOption(Some(669579)), NonZeroUsizeOption(Some(520235)), NonZeroUsizeOption(Some(252082)), NonZeroUsizeOption(Some(522888)), NonZeroUsizeOption(Some(511596)), NonZeroUsizeOption(Some(440712)), NonZeroUsizeOption(Some(895612)), NonZeroUsizeOption(Some(89093)), NonZeroUsizeOption(Some(27047)), NonZeroUsizeOption(Some(697106)), NonZeroUsizeOption(Some(169235)), NonZeroUsizeOption(Some(753832)), NonZeroUsizeOption(Some(76558)), NonZeroUsizeOption(Some(477715)), NonZeroUsizeOption(Some(316352)), NonZeroUsizeOption(Some(26361)), NonZeroUsizeOption(Some(448121)), NonZeroUsizeOption(Some(992932)), NonZeroUsizeOption(Some(747399)), NonZeroUsizeOption(Some(313402)), NonZeroUsizeOption(Some(727439)), NonZeroUsizeOption(Some(602914)), NonZeroUsizeOption(Some(440477)), NonZeroUsizeOption(Some(165095)), NonZeroUsizeOption(Some(160094)), NonZeroUsizeOption(Some(847750)),
There's too many to find the original log message. It happens when starting and restoring from the db.
@ThreeHrSleep - There are many such cases of broken logs from what I can see. I've only run this for < 10 seconds.
The way I'm testing is by following this: https://github.com/sigp/lighthouse/tree/stable/testing/network_testing
You can test on your local computer. I'm running on mainnet. The majority of issues appear when you enable debug logging.
i.e lighthouse --debug-level debug.
You might be able to find some of these things and debug faster than me.
This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏
This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏
It seems that when I had introduced #[instrument] macro usage in a previous commit, I broke some things. Sorry for all the headaches!
- I've tweaked the #[instrument] attribute macro to skip recording all arguments of a calling function/method via skip_all. We may have been recording complex/large data types would could have led to increased memory usage
- I've tweaked the #[instrument] macro with parent = None. Previously, active spans with the same name would reference each other and stack in a "hierarchy of events" tree that tracing manages under the hood. With parent = None, spans with the same name wont reference each other anymore, and are instead treated as root spans. I think we may have had some deeply nested spans that may have potentially contributed to some of our OOM issues.
Testing locally things seem okay now! Going to deploy to one of our holesky nodes soon

