alluxio
alluxio copied to clipboard
Add worker decommission phase1
What changes are proposed in this pull request?
This pr is related to #13758, mainly complete phase1.
Why are the changes needed?
Offline workers more flexible, no need to wait master detection lost worker interval. For instance, We may encounter situations like we want to remove worker nodes, or we want to offline some worker nodes to save costs, this command is useful.
Does this PR introduce any user facing changes?
Yes
Codecov Report
Merging #14000 (6a4f2f7) into master (d148ad0) will decrease coverage by
0.24%
. The diff coverage is1.71%
.
@@ Coverage Diff @@
## master #14000 +/- ##
============================================
- Coverage 43.22% 42.97% -0.25%
+ Complexity 9320 9294 -26
============================================
Files 1437 1448 +11
Lines 83863 84207 +344
Branches 10148 10181 +33
============================================
- Hits 36247 36192 -55
- Misses 44626 45004 +378
- Partials 2990 3011 +21
Impacted Files | Coverage Δ | |
---|---|---|
...o/client/block/RetryHandlingBlockMasterClient.java | 21.42% <ø> (ø) |
|
...va/alluxio/client/file/FileSystemMasterClient.java | 100.00% <ø> (ø) |
|
...ient/file/RetryHandlingFileSystemMasterClient.java | 9.55% <0.00%> (-0.22%) |
:arrow_down: |
...src/main/java/alluxio/util/ConfigurationUtils.java | 72.14% <0.00%> (-0.34%) |
:arrow_down: |
...a/alluxio/master/file/DefaultFileSystemMaster.java | 63.82% <0.00%> (-0.34%) |
:arrow_down: |
...ter/file/FileSystemMasterClientServiceHandler.java | 8.98% <0.00%> (-0.21%) |
:arrow_down: |
...li/fsadmin/command/DecommissionWorkersCommand.java | 0.00% <0.00%> (ø) |
|
.../decommissionworkers/StartDecommissionCommand.java | 0.00% <0.00%> (ø) |
|
...io/stress/common/AbstractMaxThroughputSummary.java | 0.00% <0.00%> (ø) |
|
...ava/alluxio/stress/common/GeneralBenchSummary.java | 0.00% <0.00%> (ø) |
|
... and 51 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 0355f12...6a4f2f7. Read the comment docs.
I left some comments on the issue. @jiacheliu3 @bzheng888 @maobaolong , let's have a separate design review meeting first on the worker decommission feature which is quite core and non-trivial.
https://docs.google.com/document/d/1ErVdZdDF3_qt7yFJiZAQvYoGBHr4uR-FdEOAkOvW4mI/edit?usp=sharing @jiacheliu3 Update the design doc, PTAL!
https://docs.google.com/document/d/1ErVdZdDF3_qt7yFJiZAQvYoGBHr4uR-FdEOAkOvW4mI/edit?usp=sharing @jiacheliu3 Update the design doc, PTAL!
@bzheng888 Going forward please include the design doc in the description
area of your PR, much easier for others to see. Thx.
https://docs.google.com/document/d/1ErVdZdDF3_qt7yFJiZAQvYoGBHr4uR-FdEOAkOvW4mI/edit?usp=sharing @jiacheliu3 Update the design doc, PTAL!
@bzheng888 Going forward please include the design doc in the
description
area of your PR, much easier for others to see. Thx.
Will do
@maobaolong @jiacheliu3 Please review this pr again.
As per our discussion today, I'd like to propose several changes to the design implemented in this PR:
- config file and persistent
excluded_worker
list in block master IMO, we should not use a config file to specify which workers to decommission. Instead, what we need is a "args file" to contain the worker hostnames when invoking the command, e.g.
$ alluxio fsadmin decommissionWorkers WORKER1 WORKER2
is equivalent to
$ cat ./to_decommission_workers
WORKER1
WORKER2
$ alluxio fsadmin decommissionWorkers --input-file ./to_decommission_workers
the --input-file
option is analogous to --arg-file
found in xargs
, etc.
-
decommission status
We should add a status command to see which workers are in the process of being decommissioned, and which have been decommissioned. These require two additional variables tracking the workers' states. The status command can possibly go to
alluxio fsadmin report
.
@dbw9580 Thanks for reviewing this, for you suggestion, I will submit a new pr today.
Hi, @dbw9580 @Tonytan123 What else can I do with this PR? or this pr is no longer needed?
@bzheng888 @dbw9580 @jiacheliu3 Look forward to merge this PR, after that we can improve more base on this.
@bzheng888 @maobaolong Sorry, recently we are busy in adding new features and other trifles. This pr 14000 is needed. We have almost implemented the base decommissionWorker cmd in PR 16134, and this PR may be a CLI for the decommissionWorker cmd. I have published the design doc of decommissionWorker Command in pr 16134, comments are welcome.
@YichuanSun Thanks for current work, I will continue to complete this pr and try to adapt your design doc and codes, hope we can communicate more in future.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
This function is used to remove a live worker with #16134, do not stale it.
@YichuanSun @dbw9580 Update this pr base on your design and implementation, can we complete the decommission feature base on this pr as the phase1? Please review this thanks!
Thanks Bing, I will leave some comments.
@jiacheliu3 Hi jiacheng, can you take a look again plz, thx!
@jiacheliu3 Update, PTAL!

alluxio-bot, merge this please
alluxio-bot, cherry-pick this to branch-2.8 please
Auto cherry-pick unsuccessful:
Failed to finish cmd "git cherry-pick 50c4c29279c53225d33b02d077964a430314ac19"
stderr: <error: could not apply 50c4c29279... Add worker decommission step 1
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add
stdout: <> err: exit status 1