hbbft Improve Logging

Improve Logging

Open mbr opened this issue 5 years ago • 4 comments

When running tests to track down bugs or in production, it would be very beneficial to collect traces, networking logs or simple status messages for analysis. Currently, there is a mix of logging techniques employed, for example, new-style tests use mainly println! debugging and manual log output, while there are logging macros like debug! already employed by the crate itself.

Unifying test logs

The VirtualNet class currently writes a copy of every networking message into a text file if trace() is enabled, aiding in the search for bugs in case of failure. Additional information, such as the random seed used to instantiate the test (for reproducability), or information about the internal state of the HBBFT algorithm family being tested are currently things we want to be able to record as well.

If we are to add additional logging, it should be unified all in one facility. Getting rid of the custom packet dump and instead logging each packet through the same facilities that log status messages use will hopefully reduce the amount of custom implementation. To be able to retain all the functionality currently used, the resulting logfile should be easily greppable for the networking logs. As an added benefit, it would be possible to see a packets path side-by-side with the node's information.

Complexity (println/log/slog)

Three approaches are available, which shall be named println, log and slog:

println is the most common way of outputting "logs" in tests by using the println! and eprintln! macros. The advantage is sheer simplicity and the good integration into the unit testing framework that ships with rust; by default, each test will collect all output to stdout and stderr and only display it if the test failed.
log is using the log facade and a suitable logger, typically env_logger is employed in many applications, but file-based logging solutions are also available. The crate uses global macros to record messages: info!("This is a sample log message") and most implementations require a small setup function to be called for initialization.
slog stands for "structured logging", using the slog crate. It mainly adds scope and structure to the model of log, resulting in different macros with the same naming convention being called: info!(some_log, "Something happend in node {node_id}"; node_id => self.id). Here, some_log is a specific logger instance (opposed to the info! macro of log, which knows no such context) and node_id is available as structured data in the logs.

Complexity increases with each; println! does not require any external crates or setup. The log crate requires one additional crate dependency for the library and another for the implementation using it, if it wants to process log messages in any way, as well as a setup function.

slog is the most complex beast, typically not only requiring multiple additional crates and a setup function, but also code level changes, since logger instances should be passed around to create context (as opposed to the log variant, which has a single global logger "instance"). A workaround exists in the form of slog_scope, but is not advised to be used in libraries.

Performance considerations

Right now, one successful net_dynamic_honey_badger test produces between 0.5 and 6 megabytes of network log output, during a runtime of 0.5 to 2 minutes. This does not include log messages from the dist_algorithm itself, so an upper bound of 5 megabytes per minute is probably optimistic.

The change to mocktography will yield an up to 200X speed-up, turning the log output to a potential 1 GB per core (it will finally be possible to run tests multiple times with different input parameters). Additionally, running on multiple threads (the average hardware thread count between all developers on the projects is more than 20 threads per person) can multiply this figure.

All of the logging solutions are somewhat lightweight, although it's possible that due to the low CPU complexity of mocktography the relative overhead of logging (or even VirtualNet) might be quite high, even with otherwise fast solutions. In the case of slog, some parts will need to be restructured to pass in or at least cache Logger instances, as creation involves multiple .clone()s otherwise.

Output format

While the output format can easily be changed we have two options with slog; either outputting in a traditional text-based format or outputting structured information. The latter might be handy if there are specialized tools to navigate/filter the output (suggestions welcome), but with JSON we are likely to lose readability or grep'ability, so "traditional" should be the way to go for now.

Possibilites for optimization

Implementing custom drains should be simple and allow for some optimization. One core idea is using a "spillover" or spooled log drain, similar to Python's SpooledTemporaryFile. With a buffer size of maybe 500 to 1000 megabytes, this would put our RAM requirements at roughtly 1 GB per core, but could eliminate almost all disk writes if only failed test's logs are kept. Ideally, a very large BufWriter would suffice.

Data could also be compressed transparently with a fast algorithm like gzip with a simple stream decorator, likely saving a lot of space (the space saving using compression on the current networking logs is almost 97%). When compressing in-memory as well, this would drastically reduce memory requirements as well.

Log rotation

Log rotation, that is filling up a log file until a certain size is reached, then renaming it and continuing in another file is a method to limit disk space, since older logs can periodically be removed.

Likely it is not a good fit for us here. For one, we almost always want to look at log message from a specific test only, so one file per test is much more reasonable. Rotating then only prevents an overflow, but hard limit should suffice as well. Rotation is much more beneficial when all logs for all tests are collected in a central place, but this would result in interleaved log files that quickly become hard to handle (tests generate upwards of a million entries).

Logs would therefore be removed automatically, unless a test failed. Since compressed logs are a few megabytes each, I think it's reasonable to force the user to manually delete them.

Requirements & the plan

The following features are desirable (if there is something missing here, feel free to chime in):

Persist logging in the case of failed tests, but not successful ones, unless instructed to do so. This is to save disk space when running multiple tests.
If possible, reduce disk space used by logs.
Save each network message of VirtualNet through tracing.
Save each logged message from hbbft, with context (e.g. node ID).
Store all initialization parameters for a test in the logs; in case of a crash/infinite loop, this allows reproducing the test.
(optional) Migrate hbbft to slog.

For now, I would suggest implementing everything using a slog based implementation, realizing the features as follows:

Use a spooled log drain. Upon successful test completion, the resulting buffer is dropped and the (potentially existing) file removed. That way, successful tests do not accumulate on the HD.
Compress everything immediately.
Replace the current logging implementation by trace! instead.
Use slog-stdlog and slog_scope to capture hbbft log messages (the scope will have to be changed before each call into hbbft to ensure the context is correct, possibly using slog_atomic.
Slightly trickier, to ensure all the logs are written, we could either write this info and flush, causing a file to created (but not necessarily filled) for each test, or append to an additional log of smaller size that just collects this kind of information.
Some API changes are likely required for full slog support, mainly passing loggers around. This would make 4. obsolete and can be done earlier to save the implementation of the workaround, but it's also probably the most complicated step in the whole process.

Sep 27 '18 13:09 mbr

I think these plans are fine but I can't help digressing just a bit...

Let me just reiterate my position that I think logging within a library (not including the library's tests!) should be done sparingly, if at all. I realize my stance seems a bit severe (though I'm not alone). The thing is that we library authors tend to be lazy. When some non-critical event occurs, our first instinct is simply to throw it somewhere so that we don't have to think about it or deal with it. Logging becomes a catch-all grab bag for these kinds of events.

It's much more important to minimize the amount of noise we generate and instead seriously consider the implications of the various exceptions that we would otherwise log, ideally doing something intelligent about them (handling them or redesigning something). If that is absolutely not possible, only then passing them back to the caller to be handled (as we do with FaultLog). Any messages other than that should be eliminated.

I'm not saying let's remove logging from hbbft tomorrow but I am saying let's at least start considering moving in that direction. As we begin integration into Parity it's going to become more and more important that we reduce or eliminate our 'environmental impact'.

Sep 27 '18 17:09 c0gent

All that being said @mbr, logging properly within tests is important so please don't interpret my post as a criticism of what you're planning to do. The only thing I'd have to say there is to keep it as simple as possible for others to write tests using whichever logging framework you decide (though this is probably an argument against slog). The caching plan is great and I fully support that.

Sep 27 '18 17:09 c0gent

A path to overcome our (mine in particular) dependence on excessive logging might be to embrace proptest regressions. The problem that logs easily "solve" (modulo your ability to construct succinct regexps) is availability of intermediate states. It is easy to build tests from outputs of DistAlgorithm::handle_input and DistAlgorithm::handle_message. However, if some internal bugs are not present in that output, such a test would not find any useful regression. Tests don't have access to the Step while it is being constructed and cannot find bugs in that process of construction, which is the bulk of the inner workings of a DistAlgorithm: step.extend or step.extend_with. I still think regressions are applicable in this wider setting but in order to construct useful regressions, we need to leave execution trace from functions that mutate steps. When those functions their return output and the trace to the test, the test can treat the trace as a kind of output and perform regressions on the trace as well.

Sep 27 '18 18:09 vkomenda

That's not a bad idea. It's may also be wise to consider redesigning things into small, independently testable components as much as possible as well as to write tests that properly isolate those components (using test ~harnesses~ beds, etc.). Small tests could then more precisely and accurately determine if and where a regression has occurred.

In other words, if there is some potentially fragile process happening within a method or component, figuring out a way to break it down to expose that process by creating a more composable design is probably a good design principle to apply as much as possible.

Sep 27 '18 19:09 c0gent

hbbft hbbft copied to clipboard

Improve Logging

Unifying test logs

Complexity (println/log/slog)

Performance considerations

Output format

Possibilites for optimization

Log rotation

Requirements & the plan

hbbft
hbbft copied to clipboard