dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Feature-10498] Mask the password in the log of sqoop task

Open rickchengx opened this issue 2 years ago • 10 comments

Purpose of the pull request

Mask the password in the log of sqoop task.

Currently, there are 2 positions that the log of sqoop task will output the password of mysql: https://github.com/apache/dolphinscheduler/blob/17a9dd25fa0e80b048394f79db130f56eb8ef72f/dolphinscheduler-task-plugin/dolphinscheduler-task-sqoop/src/main/java/org/apache/dolphinscheduler/plugin/task/sqoop/SqoopTask.java#L83

https://github.com/apache/dolphinscheduler/blob/17a9dd25fa0e80b048394f79db130f56eb8ef72f/dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/ShellCommandExecutor.java#L116

Brief change log

Verify this pull request

Manually tested.

截屏2022-08-22 15 30 38

rickchengx avatar Aug 22 '22 07:08 rickchengx

related: #10498

EricGao888 avatar Aug 22 '22 08:08 EricGao888

This is a good feature. Could you please add a UT for it? Thanks.

EricGao888 avatar Aug 24 '22 11:08 EricGao888

This is a good feature. Could you please add a UT for it? Thanks.

Hi, @EricGao888, Sorry for the late response. I've added a UT.

rickchengx avatar Aug 31 '22 10:08 rickchengx

Codecov Report

Merging #11589 (6708c75) into dev (64a29c6) will increase coverage by 0.00%. The diff coverage is 44.44%.

@@            Coverage Diff            @@
##                dev   #11589   +/-   ##
=========================================
  Coverage     39.35%   39.36%           
  Complexity     4270     4270           
=========================================
  Files          1067     1067           
  Lines         40118    40121    +3     
  Branches       4606     4605    -1     
=========================================
+ Hits          15790    15794    +4     
+ Misses        22552    22547    -5     
- Partials       1776     1780    +4     
Impacted Files Coverage Δ
.../dolphinscheduler/plugin/task/sqoop/SqoopTask.java 0.00% <0.00%> (ø)
...inscheduler/common/log/SensitiveDataConverter.java 63.15% <50.00%> (ø)
...erver/master/processor/queue/TaskEventService.java 69.64% <0.00%> (-10.72%) :arrow_down:
...dolphinscheduler/remote/future/ResponseFuture.java 81.96% <0.00%> (-1.64%) :arrow_down:
...r/plugin/registry/zookeeper/ZookeeperRegistry.java 50.00% <0.00%> (+6.45%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Sep 01 '22 02:09 codecov-commenter

May I ask whether we could do this with some other ways in sqoop task plugin instead of add a specific condition here for sqoop? ShellCommandExecutor is shared across different task plugins, it might not be a good practice and uneasy to maintain if we need to change it every time we need to mask something for a task plugin. The same for line 55-56.

Hi, @EricGao888. Thanks for your suggestions. I have moved the masking logic to a common util class.

Background

There are 2 possible locations that the log may contain the sensitive data (e.g., password). Taking the sqoop task as an example:

  1. The log output in the buildCommand() in SqoopTask.

https://github.com/apache/dolphinscheduler/blob/28469fc9b23094a74cbef83b9990d6fbced20d79/dolphinscheduler-task-plugin/dolphinscheduler-task-sqoop/src/main/java/org/apache/dolphinscheduler/plugin/task/sqoop/SqoopTask.java#L82-L84

  1. The log output in the createCommandFileIfNotExists(String execCommand, String commandFile) in ShellCommandExecutor. We need to mask the sensitive data of execCommand here since the execCommand will be output in the log. But this class is shared by multiple task plugins and how to mask the sensitive data needs to be discussed.

https://github.com/apache/dolphinscheduler/blob/17a9dd25fa0e80b048394f79db130f56eb8ef72f/dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/ShellCommandExecutor.java#L115-L116

As for me, there are 2 ways:

1. Mask the sensitive data at the point of outputting the task log and every task plugin applys the same masking logic

  • Encapsulate the logger of AbstractTaskExecutor. Every time the task log is output, the output statement will be matched to see if there is any sensitive data.

https://github.com/apache/dolphinscheduler/blob/ebcffb04aad9db8ec6df1105e4770b187088e701/dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/AbstractTaskExecutor.java#L37

  • By default, common sensitive data will be matched, and users can customize the sensitive information that needs to be masked.
image

image

Note that the command after masking will only used to output in the log, so this method will not affect the actual command executed.

Reference: Sensitive data masking of Airflow

2. Each task plugin implements its own masking logic

Refactor the createCommandFileIfNotExists() and add a param String execCommandMasking.

https://github.com/apache/dolphinscheduler/blob/28469fc9b23094a74cbef83b9990d6fbced20d79/dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/ShellCommandExecutor.java#L84-L87

Also AbstractCommandExecutor.run(execCommand) needs to be refactored to AbstractCommandExecutor.run(execCommand, execCommandMasking).

https://github.com/apache/dolphinscheduler/blob/28469fc9b23094a74cbef83b9990d6fbced20d79/dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/AbstractCommandExecutor.java#L176

By doing this, the process logic of how to mask the sensitive data can be done in each task plugin. But this will modify AbstractCommandExecutor and ShellCommandExecutor, and many task plugins need to be modified and implement their own masking logic.

Below is the list of affected classes:

image

Conclusion

I personally prefer the first way to mask the sensitive data. And I also sent a related discussion email to the mailing list.

So which way do you think is better? Or there are other better ways to do so. Any comments or suggestions are welcome!

rickchengx avatar Sep 05 '22 07:09 rickchengx

We have use SensitiveDataConverter for the whole log to hide the password.

ruanwenjun avatar Sep 07 '22 12:09 ruanwenjun

We have use SensitiveDataConverter for the whole log to hide the password.

Thanks for your suggestion! I'll look into it.

rickchengx avatar Sep 08 '22 03:09 rickchengx

We have use SensitiveDataConverter for the whole log to hide the password.

Hi, @ruanwenjun , It seems that there is a bug in SensitiveDataConverter which will be fixed in #11459

After #11459 is fixed, maybe we can add a pattern in SensitiveDataConverter to mask the password in sqoop task.

rickchengx avatar Sep 13 '22 10:09 rickchengx

Hi, @ruanwenjun , @EricGao888 , I've made changes to my PR as below:

  1. Use SensitiveDataConverter to uniformly mask sensitive information in task logs.
  2. Each task plugin can add its own regular match expressions to SensitiveDataConverter through addMaskPattern(). E.g., Sqoop task adds its own regular expression in init():
SensitiveDataConverter.addMaskPattern(SqoopConstants.SQOOP_PASSWORD_REGEX);

This PR solves the problem of the mysql password in the sqoop task log. If others find that other task types will output sensitive information in the log, they only need to add their own regular expression through addMaskPattern() in the init() of the task plugin.

rickchengx avatar Sep 19 '22 10:09 rickchengx

  1. Use SensitiveDataConverter to uniformly mask sensitive information in task logs.
  2. Each task plugin can add its own regular match expressions to SensitiveDataConverter through addMaskPattern(). E.g., Sqoop task adds its own regular expression in init():
SensitiveDataConverter.addMaskPattern(SqoopConstants.SQOOP_PASSWORD_REGEX);

This PR solves the problem of the mysql password in the sqoop task log. If others find that other task types will output sensitive information in the log, they only need to add their own regular expression through addMaskPattern() in the init() of the task plugin.

Hi, @ruanwenjun , could you please help review this? I've rebased to remove conflicts.

rickchengx avatar Oct 13 '22 07:10 rickchengx

Hi, @ruanwenjun , @EricGao888 , I've made changes to my PR as below:

  1. Use SensitiveDataConverter to uniformly mask sensitive information in task logs.
  2. Each task plugin can add its own regular match expressions to SensitiveDataConverter through addMaskPattern(). E.g., Sqoop task adds its own regular expression in init():
SensitiveDataConverter.addMaskPattern(SqoopConstants.SQOOP_PASSWORD_REGEX);

This PR solves the problem of the mysql password in the sqoop task log. If others find that other task types will output sensitive information in the log, they only need to add their own regular expression through addMaskPattern() in the init() of the task plugin.

Hi, @EricGao888 @caishunfeng , could you please help review this?

rickchengx avatar Nov 07 '22 03:11 rickchengx

Hi, @caishunfeng , I've moved SensitiveDataConverter to dolphinscheduler-common module.

cc @zhongjiajie

rickchengx avatar Nov 24 '22 06:11 rickchengx