flyteadmin icon indicating copy to clipboard operation
flyteadmin copied to clipboard

Resource manager overhaul

Open wild-endeavor opened this issue 2 years ago • 3 comments

TL;DR

This change reworks the way task resources are handled in Admin. Currently the behavior:

  • For matchable task resources, settings at different levels of specificity completely wipe out each other. That is, if you have a default of CPU="1" set at the project level for project X and Memory="1Gi" set at the project/domain level (e.g. X-development), then the latter setting wipes out the first one entirely. With this change, both the CPU and Memory setting will be returned.
  • This merging affect both the matchable resources endpoints (e.g. http://localhost:8088/api/v1/project_attributes/flytesnacks?resource_type=0) and when workflows are launched with tasks/containers that do not have resources set.

Also,

  • Defaults that were previously set in code, have been removed. (Another PR will need to be put in to remove any defaults in the Helm charts/sandbox environment.)

What this PR does not change:

  • Limits are still set to requests for all resources. This was done because having memory (and possibly other resource types) defaults not equal to limits results in oversubscription and then possibly strange/inconsistent eviction logic. From K8s docs: "Guaranteed pods are guaranteed only when requests and limits are specified for all the containers and they are equal." Please see the rejected IDL change for a little more discussion.

Type

  • [ ] Bug Fix
  • [ ] Feature
  • [ ] Plugin

Are all requirements met?

  • [ ] Code completed
  • [ ] Smoke tested
  • [ ] Unit tests added
  • [ ] Code documentation added
  • [ ] Any pending items have an associated Issue

Complete description

Tracking Issue

https://github.com/flyteorg/flyte/issues/3574 https://github.com/flyteorg/flyte/issues/3467 https://github.com/flyteorg/flyte/issues/3399

wild-endeavor avatar Apr 14 '23 00:04 wild-endeavor

Codecov Report

Merging #552 (26ec924) into master (cca9e17) will increase coverage by 0.90%. The diff coverage is 50.11%.

:exclamation: Current head 26ec924 differs from pull request most recent head f6fa4ca. Consider uploading reports for the commit f6fa4ca to get more accurate results

@@            Coverage Diff             @@
##           master     #552      +/-   ##
==========================================
+ Coverage   58.78%   59.68%   +0.90%     
==========================================
  Files         170      170              
  Lines       16055    13413    -2642     
==========================================
- Hits         9438     8006    -1432     
+ Misses       5783     4557    -1226     
- Partials      834      850      +16     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/manager/impl/util/shared.go 69.07% <ø> (+3.55%) :arrow_up:
pkg/repositories/database.go 43.33% <ø> (+2.81%) :arrow_up:
pkg/repositories/gormimpl/resource_repo.go 56.20% <0.00%> (-7.67%) :arrow_down:
pkg/rpc/adminservice/base.go 3.88% <0.00%> (+0.49%) :arrow_up:
pkg/runtime/application_config_provider.go 18.18% <0.00%> (-15.16%) :arrow_down:
pkg/runtime/task_resource_provider.go 8.69% <0.00%> (-24.64%) :arrow_down:
pkg/manager/impl/util/resources.go 50.25% <47.87%> (-40.42%) :arrow_down:
pkg/manager/impl/resources/resource_manager.go 64.65% <56.39%> (+0.61%) :arrow_up:
pkg/workflowengine/impl/prepare_execution.go 85.88% <70.00%> (+0.88%) :arrow_up:
pkg/manager/impl/execution_manager.go 72.26% <92.30%> (+2.25%) :arrow_up:
... and 4 more

... and 144 files with indirect coverage changes

codecov[bot] avatar May 03 '23 22:05 codecov[bot]

Personally I'm not in favor of "merging" these settings across hierarchies and I prefer that those are applied as a whole from the most specific level. While I can also see the appeal of just overriding some settings in higher/more specific levels, IMHO this just makes it more implicit (hence more complicated and harder to understand) with only a small benefit. But in the end this is not so important to me.

Either way the important part is: it needs to be possible to unset specific settings (i.e. cpu limits), either by omission or by explicit null/0.

flixr avatar May 05 '23 10:05 flixr

Also I'm wondering if we should rather leverage k8s limit ranges instead of re-implementing all this in flyteadmin?

flixr avatar May 05 '23 12:05 flixr