alluxio icon indicating copy to clipboard operation
alluxio copied to clipboard

Add worker decommission phase1

Open bzheng888 opened this issue 3 years ago • 8 comments

What changes are proposed in this pull request?

This pr is related to #13758, mainly complete phase1.

Why are the changes needed?

Offline workers more flexible, no need to wait master detection lost worker interval. For instance, We may encounter situations like we want to remove worker nodes, or we want to offline some worker nodes to save costs, this command is useful.

Does this PR introduce any user facing changes?

Yes

bzheng888 avatar Aug 27 '21 14:08 bzheng888

Codecov Report

Merging #14000 (6a4f2f7) into master (d148ad0) will decrease coverage by 0.24%. The diff coverage is 1.71%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #14000      +/-   ##
============================================
- Coverage     43.22%   42.97%   -0.25%     
+ Complexity     9320     9294      -26     
============================================
  Files          1437     1448      +11     
  Lines         83863    84207     +344     
  Branches      10148    10181      +33     
============================================
- Hits          36247    36192      -55     
- Misses        44626    45004     +378     
- Partials       2990     3011      +21     
Impacted Files Coverage Δ
...o/client/block/RetryHandlingBlockMasterClient.java 21.42% <ø> (ø)
...va/alluxio/client/file/FileSystemMasterClient.java 100.00% <ø> (ø)
...ient/file/RetryHandlingFileSystemMasterClient.java 9.55% <0.00%> (-0.22%) :arrow_down:
...src/main/java/alluxio/util/ConfigurationUtils.java 72.14% <0.00%> (-0.34%) :arrow_down:
...a/alluxio/master/file/DefaultFileSystemMaster.java 63.82% <0.00%> (-0.34%) :arrow_down:
...ter/file/FileSystemMasterClientServiceHandler.java 8.98% <0.00%> (-0.21%) :arrow_down:
...li/fsadmin/command/DecommissionWorkersCommand.java 0.00% <0.00%> (ø)
.../decommissionworkers/StartDecommissionCommand.java 0.00% <0.00%> (ø)
...io/stress/common/AbstractMaxThroughputSummary.java 0.00% <0.00%> (ø)
...ava/alluxio/stress/common/GeneralBenchSummary.java 0.00% <0.00%> (ø)
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0355f12...6a4f2f7. Read the comment docs.

codecov-commenter avatar Aug 27 '21 14:08 codecov-commenter

I left some comments on the issue. @jiacheliu3 @bzheng888 @maobaolong , let's have a separate design review meeting first on the worker decommission feature which is quite core and non-trivial.

apc999 avatar Sep 02 '21 19:09 apc999

https://docs.google.com/document/d/1ErVdZdDF3_qt7yFJiZAQvYoGBHr4uR-FdEOAkOvW4mI/edit?usp=sharing @jiacheliu3 Update the design doc, PTAL!

bzheng888 avatar Oct 19 '21 11:10 bzheng888

https://docs.google.com/document/d/1ErVdZdDF3_qt7yFJiZAQvYoGBHr4uR-FdEOAkOvW4mI/edit?usp=sharing @jiacheliu3 Update the design doc, PTAL!

@bzheng888 Going forward please include the design doc in the description area of your PR, much easier for others to see. Thx.

jiacheliu3 avatar Oct 19 '21 12:10 jiacheliu3

https://docs.google.com/document/d/1ErVdZdDF3_qt7yFJiZAQvYoGBHr4uR-FdEOAkOvW4mI/edit?usp=sharing @jiacheliu3 Update the design doc, PTAL!

@bzheng888 Going forward please include the design doc in the description area of your PR, much easier for others to see. Thx.

Will do

bzheng888 avatar Oct 20 '21 01:10 bzheng888

@maobaolong @jiacheliu3 Please review this pr again.

bzheng888 avatar Oct 22 '21 02:10 bzheng888

As per our discussion today, I'd like to propose several changes to the design implemented in this PR:

  1. config file and persistent excluded_worker list in block master IMO, we should not use a config file to specify which workers to decommission. Instead, what we need is a "args file" to contain the worker hostnames when invoking the command, e.g.
$ alluxio fsadmin decommissionWorkers WORKER1 WORKER2

is equivalent to

$ cat ./to_decommission_workers
WORKER1
WORKER2
$ alluxio fsadmin decommissionWorkers --input-file ./to_decommission_workers

the --input-file option is analogous to --arg-file found in xargs, etc.

  1. decommission status

    We should add a status command to see which workers are in the process of being decommissioned, and which have been decommissioned. These require two additional variables tracking the workers' states. The status command can possibly go to alluxio fsadmin report.

dbw9580 avatar Sep 05 '22 08:09 dbw9580

@dbw9580 Thanks for reviewing this, for you suggestion, I will submit a new pr today.

bzheng888 avatar Sep 06 '22 02:09 bzheng888

Hi, @dbw9580 @Tonytan123 What else can I do with this PR? or this pr is no longer needed?

bzheng888 avatar Nov 02 '22 03:11 bzheng888

@bzheng888 @dbw9580 @jiacheliu3 Look forward to merge this PR, after that we can improve more base on this.

maobaolong avatar Nov 03 '22 11:11 maobaolong

@bzheng888 @maobaolong Sorry, recently we are busy in adding new features and other trifles. This pr 14000 is needed. We have almost implemented the base decommissionWorker cmd in PR 16134, and this PR may be a CLI for the decommissionWorker cmd. I have published the design doc of decommissionWorker Command in pr 16134, comments are welcome.

YichuanSun avatar Nov 10 '22 05:11 YichuanSun

@YichuanSun Thanks for current work, I will continue to complete this pr and try to adapt your design doc and codes, hope we can communicate more in future.

bzheng888 avatar Jan 04 '23 03:01 bzheng888

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 04 '23 15:02 github-actions[bot]

This function is used to remove a live worker with #16134, do not stale it.

YichuanSun avatar Feb 05 '23 01:02 YichuanSun

@YichuanSun @dbw9580 Update this pr base on your design and implementation, can we complete the decommission feature base on this pr as the phase1? Please review this thanks!

bzheng888 avatar Feb 16 '23 13:02 bzheng888

Thanks Bing, I will leave some comments.

YichuanSun avatar Feb 25 '23 01:02 YichuanSun

@jiacheliu3 Hi jiacheng, can you take a look again plz, thx!

bzheng888 avatar Mar 26 '23 14:03 bzheng888

@jiacheliu3 Update, PTAL!

bzheng888 avatar Mar 29 '23 10:03 bzheng888

image

bzheng888 avatar Mar 29 '23 12:03 bzheng888

alluxio-bot, merge this please

jiacheliu3 avatar Mar 30 '23 05:03 jiacheliu3

alluxio-bot, cherry-pick this to branch-2.8 please

jiacheliu3 avatar Apr 26 '23 12:04 jiacheliu3

Auto cherry-pick unsuccessful: Failed to finish cmd "git cherry-pick 50c4c29279c53225d33b02d077964a430314ac19" stderr: <error: could not apply 50c4c29279... Add worker decommission step 1 hint: after resolving the conflicts, mark the corrected paths hint: with 'git add ' or 'git rm ' hint: and commit the result with 'git commit'

stdout: <> err: exit status 1

alluxio-bot avatar Apr 26 '23 12:04 alluxio-bot