spark [SPARK-48592][INFRA] Add scala style check for logging message inline variables

What changes were proposed in this pull request?

This PR bans logging messages using logInfo, logWarning, logError and containing variables without MDC wrapper

Why are the changes needed?

This makes development and PR review of the structured logging migration easier.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test, verified it will throw errors on invalid logging messages.

Was this patch authored or co-authored using generative AI tooling?

No

Jun 12 '24 01:06 asl3

cc @gengliangwang https://issues.apache.org/jira/browse/SPARK-48592

Jun 12 '24 01:06 asl3

@gengliangwang It might be aggressive to ban all log{Info|Warning|Error} calls with string at the current stage, especially without SPARK-47688.

I suppose that the values defined in LogKeys are public APIs, and they should not be removed after 4.0.0 is released. Well, the current list might need to polish, for example, ALPHA, BOOT, INIT might be too general to represent the concrete concept, they were exposed because the current implementation requires all substitutable variables MUST be a LogKeys.

Values of LogKeys like *_TIME are inconsistent, some values have the unit but some do not.

Jun 12 '24 06:06 pan3793

I strongly suggest reconsidering SPARK-47688 and being conservative on defining LogKeys in the first GA version that delivers this awesome feature, and evaluating each new LogKeys carefully in the subsequent versions to avoid breaking public API as much as possible.

also cc @cloud-fan @LuciferYang

Jun 12 '24 06:06 pan3793

@pan3793 I appreciate the feedback and concerns raised regarding the migration of variables in our new structured logging framework for Apache Spark. I'd like to address these concerns and provide some clarity on our approach.

Why Migrate All Variables?

The primary reason for migrating all variables in our log messages is to avoid ongoing debates about which specific keys should be included as log keys. By disallowing the inlining of variables in log messages and ensuring all variables are explicitly named and migrated, we achieve a consistent and comprehensive logging structure. This approach ensures uniformity across all log entries.

Current Status of Migration

The migration process is nearly complete, which means we've already invested significant effort into ensuring that our log entries adhere to the new framework. Moving forward, as long as we follow the new logging rules, we maintain control over all the variables in our logs. This control is crucial for effective log analysis and troubleshooting.

Managing Variable Overload

To address concerns about the potential overload of variables for users, we have a couple of solutions:

Configuration Allow List: We can build a configuration option to pass an allow list of log keys, enabling users to specify which variables should appear in their logs.
Partitioned Log Keys: We can partition the log keys so that only essential ones appear by default, while others show up in verbose logging mode. This way, users can choose the level of detail they want to see based on their needs.

Our goal with these changes is to enhance the clarity, consistency, and utility of our logs.

Jun 12 '24 21:06 gengliangwang

@gengliangwang thanks for the summary.

To address concerns about the potential overload of variables for users ...

Major concerns come from public API exposure. Spark is conservative for deleting deprecated public API, for example, HiveContext has been marked as deprecated since 2.0 and it still lives. Nearly a thousand LogKeys were added in a short time, and were exposed as public API.

The primary reason for migrating all variables in our log messages is to avoid ongoing debates about which specific keys should be included as log keys.

IMO the debates are valuable, I do think NOT all variables are suitable for the LogKey concept. Additionally, we may need to tune the original logs to adapt to the new structured logging framework. For example, exposing the TASK_ID to all task-specific logs makes it easy to filter out each task's logs.

Jun 13 '24 02:06 pan3793

Nearly a thousand LogKeys were added in a short time, and were exposed as public API.

LogKey is totally internal. It is under the package org.apache.spark.internal

Additionally, we may need to tune the original logs to adapt to the new structured logging framework. For example, exposing the TASK_ID to all task-specific logs makes it easy to filter out each task's logs.

Yes, we should improve this. But this is orthogonal to what this PR tries to do. With the enforcement of making variables as MDC, we can remind developers to put variables as MDC or verbose MDC in new log entries. On the other hand, new log entries may not use the new framework, and the project eventually becomes useless.

Jun 13 '24 18:06 gengliangwang

LogKey is totally internal.

If then, can we make the LogKeys scope to private[spark]? so that Mima won't complain if we change keys in the future.

- object LogKeys {
+ private[spark] object LogKeys {

Jun 14 '24 03:06 pan3793

LogKey is totally internal.

If then, can we make the LogKeys scope to private[spark]? so that Mima won't complain if we change keys in the future.
- object LogKeys {
+ private[spark] object LogKeys {

+1

Jun 14 '24 04:06 LuciferYang

@pan3793 @LuciferYang sure, I created https://github.com/apache/spark/pull/46983

Jun 14 '24 05:06 gengliangwang

I am closing this one since we already merged https://github.com/apache/spark/pull/47239

Aug 05 '24 22:08 gengliangwang