spark
spark copied to clipboard
[SPARK-39970][CORE] Introduce ThrottledLogger to prevent log message flooding caused by network issues
What changes were proposed in this pull request?
This PR implemented a ThrottledLogger, a logger with RateLimiters, to prevent log message flooding caused by network issues. In our ThrottledLogger, we enable users to register prefixes. Each prefix has its RateLimiter, and each prefix has its rate limit strategy (by setting different value of throttlingSeconds). Each RateLimiter will create a permit per second, and printing a message needs to acquire throttlingSeconds permits. For example,
ThrottledLogger tlogger = new ThrottledLogger("ThrottledLogger");
tlogger.registerPrefix("msg", 2); // Printing a message with the prefix "msg" consumes 2 permits.
// will be printed
tlogger.info("msg1");
// The RateLimiter of prefix "msg" does not have enough permits. => will not be printed
tlogger.info("msg2");
// The message which does not match with any registered prefix will always be printed.
tlogger.info("abc");
// Sleep two seconds, and the RateLimiter will get 2 permits.
Thread.sleep(2000);
// The RateLimiter of the prefix "msg" has enough permits. => print
tlogger.info("msg3");
Why are the changes needed?
When transient network error occurs, Spark may write out a large volume of error logs related to the network errors. The excessive logging can create further problems downstream, e.g. blow up our log storage. If we can combine / batch the network errors when they come in a burst, we only print out periodic summaries of errors that are happening repeatedly in a short time window.
Does this PR introduce any user-facing change?
No
How was this patch tested?
build/sbt "network-common/testOnly *ThrottledLoggerSuite"
cc. @Ngone51 @jiangxb1987
Can one of the admins verify this patch?
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!