spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39970][CORE] Introduce ThrottledLogger to prevent log message flooding caused by network issues

Open kevin85421 opened this issue 3 years ago • 2 comments
trafficstars

What changes were proposed in this pull request?

This PR implemented a ThrottledLogger, a logger with RateLimiters, to prevent log message flooding caused by network issues. In our ThrottledLogger, we enable users to register prefixes. Each prefix has its RateLimiter, and each prefix has its rate limit strategy (by setting different value of throttlingSeconds). Each RateLimiter will create a permit per second, and printing a message needs to acquire throttlingSeconds permits. For example,

ThrottledLogger tlogger = new ThrottledLogger("ThrottledLogger");
tlogger.registerPrefix("msg", 2); // Printing a message with the prefix "msg" consumes 2 permits.

// will be printed
tlogger.info("msg1"); 

// The RateLimiter of prefix "msg" does not have enough permits. => will not be printed
tlogger.info("msg2"); 

// The message which does not match with any registered prefix will always be printed.
tlogger.info("abc"); 

// Sleep two seconds, and the RateLimiter will get 2 permits.
Thread.sleep(2000);

// The RateLimiter of the prefix "msg" has enough permits. => print
tlogger.info("msg3");

Why are the changes needed?

When transient network error occurs, Spark may write out a large volume of error logs related to the network errors. The excessive logging can create further problems downstream, e.g. blow up our log storage. If we can combine / batch the network errors when they come in a burst, we only print out periodic summaries of errors that are happening repeatedly in a short time window.

Does this PR introduce any user-facing change?

No

How was this patch tested?

build/sbt "network-common/testOnly *ThrottledLoggerSuite"

kevin85421 avatar Aug 04 '22 22:08 kevin85421

cc. @Ngone51 @jiangxb1987

kevin85421 avatar Aug 04 '22 22:08 kevin85421

Can one of the admins verify this patch?

AmplabJenkins avatar Aug 06 '22 02:08 AmplabJenkins

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 15 '22 00:11 github-actions[bot]