yunikorn-k8shim icon indicating copy to clipboard operation
yunikorn-k8shim copied to clipboard

[YUNIKORN-802] Supports to assign nodes to non-default partition

Open chia7712 opened this issue 3 years ago • 10 comments

What is this PR for?

see comment (https://issues.apache.org/jira/browse/YUNIKORN-22?focusedCommentId=17398860&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17398860)

Currently, all nodes are hardcode to be assigned to "default" partition. That brings two disadvantages.

  1. we can't select specify nodes, which are used to execute spark job only, from a cluster
  2. multi-partitions does not work since non-default partition can't get nodes

What type of PR is it?

  • [ ] - Bug Fix
  • [ ] - Improvement
  • [x] - Feature
  • [ ] - Documentation
  • [ ] - Hot Fix
  • [ ] - Refactoring

Todos

  1. support to change partition assignment of existent node (in this PR, the update request will be skipped)
  2. support to remove existent node which had been reassigned (in this PR, removing such node cause error message "Failed to update non existing node ...")

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-802

How should this be tested?

Screenshots (if appropriate)

Questions:

  • [ ] - The licenses files need update.
  • [ ] - There is breaking changes for older versions.
  • [ ] - It needs documentation.

chia7712 avatar Aug 17 '21 14:08 chia7712

Codecov Report

Merging #293 (ce9e427) into master (c47ed51) will increase coverage by 3.51%. The diff coverage is 78.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #293      +/-   ##
==========================================
+ Coverage   59.75%   63.27%   +3.51%     
==========================================
  Files          35       37       +2     
  Lines        3133     3455     +322     
==========================================
+ Hits         1872     2186     +314     
- Misses       1180     1182       +2     
- Partials       81       87       +6     
Impacted Files Coverage Δ
pkg/appmgmt/appmgmt_recovery.go 67.50% <0.00%> (-8.18%) :arrow_down:
pkg/cache/amprotocol_mock.go 0.00% <0.00%> (ø)
pkg/cache/context_recovery.go 45.23% <0.00%> (-1.11%) :arrow_down:
pkg/cache/external/scheduler_cache.go 34.56% <0.00%> (+0.40%) :arrow_up:
pkg/common/events/recorder_mock.go 0.00% <0.00%> (ø)
pkg/common/events/states.go 0.00% <0.00%> (ø)
pkg/controller/application/app_controller.go 71.05% <ø> (-0.26%) :arrow_down:
...missioncontrollers/webhook/admission_controller.go 33.74% <0.00%> (+1.00%) :arrow_up:
pkg/shim/main.go 0.00% <ø> (ø)
pkg/common/events/recorder.go 33.33% <40.00%> (+3.33%) :arrow_up:
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0bd9366...ce9e427. Read the comment docs.

codecov[bot] avatar Aug 18 '21 02:08 codecov[bot]

Lots has changed since this was started. I still think this is a good idea to fix. There are other changes going in for YUNIKORN-1090 and YUNIKORN-1091, I think we need to clean that up before we restart this one.

wilfred-s avatar Feb 24 '22 06:02 wilfred-s

Hi. Is there any further progress by any chance? I'm waiting as this feature is very useful and needed.

heumsi avatar May 09 '23 06:05 heumsi

Hi. Is there any further progress by any chance? I'm waiting as this feature is very useful and needed.

hi @chia7712 are you still actively working on this ticket? If not, i can help the work.

lixmgl avatar May 09 '23 15:05 lixmgl

@lixmgl feel free to take over it. thanks :)

chia7712 avatar May 09 '23 15:05 chia7712

@lixmgl feel free to take over it. thanks :)

Will do :)

lixmgl avatar May 09 '23 16:05 lixmgl

This is a very stale PR at this point - I also think we need a design doc on this before proceeding. See comments on https://issues.apache.org/jira/browse/YUNIKORN-22.

craigcondit avatar May 09 '23 17:05 craigcondit

Any progress?

heumsi avatar Sep 20 '23 02:09 heumsi

@heumsi multiple partition support needs an extensive design doc before we can proceed as there are many complications, especially with Kubernetes. As far as I know, the design work has not been completed yet.

craigcondit avatar Sep 20 '23 03:09 craigcondit

Hi @heumsi, in YuniKorn, if we only use labels to group nodes into different partitions and use labels to assign applications to different nodes, I think this can be achieved by nodeSelector. Not sure whether you have other scenarios which must be done with different partitions in YuniKorn. If yes, could you share in the thread? Thank you!

FrankYang0529 avatar Oct 24 '23 12:10 FrankYang0529

Does it make sense to have this PR open? We don't even have SchedulerNode and nodes.go anymore. This change needs a complete rewrite, but as mentioned above, we first need a design doc, which doesn't exist.

I suggest closing this. cc @chia7712

pbacsko avatar May 13 '24 22:05 pbacsko

Does it make sense to have this PR open? We don't even have SchedulerNode and nodes.go anymore. This change needs a complete rewrite, but as mentioned above, we first need a design doc, which doesn't exist.

I suggest closing this. cc @chia7712

I would go so far as to close YUNIKORN-802 and the parent umbrella YUNIKORN-22. This has been open for years, and no one has yet put forth a coherent design for doing partitions in Kubernetes. I'm pretty convinced its not possible without breaking the whole thing.

craigcondit avatar May 13 '24 22:05 craigcondit

@pbacsko @craigcondit agreed. let's me cleanup related tickets :)

chia7712 avatar May 14 '24 06:05 chia7712