flyteadmin
flyteadmin copied to clipboard
Resource manager overhaul
TL;DR
This change reworks the way task resources are handled in Admin. Currently the behavior:
- For matchable task resources, settings at different levels of specificity completely wipe out each other. That is, if you have a default of
CPU="1"set at the project level for project X andMemory="1Gi"set at the project/domain level (e.g. X-development), then the latter setting wipes out the first one entirely. With this change, both the CPU and Memory setting will be returned. - This merging affect both the matchable resources endpoints (e.g.
http://localhost:8088/api/v1/project_attributes/flytesnacks?resource_type=0) and when workflows are launched with tasks/containers that do not have resources set.
Also,
- Defaults that were previously set in code, have been removed. (Another PR will need to be put in to remove any defaults in the Helm charts/sandbox environment.)
What this PR does not change:
- Limits are still set to requests for all resources. This was done because having memory (and possibly other resource types) defaults not equal to limits results in oversubscription and then possibly strange/inconsistent eviction logic. From K8s docs: "
Guaranteedpods are guaranteed only when requests and limits are specified for all the containers and they are equal." Please see the rejected IDL change for a little more discussion.
Type
- [ ] Bug Fix
- [ ] Feature
- [ ] Plugin
Are all requirements met?
- [ ] Code completed
- [ ] Smoke tested
- [ ] Unit tests added
- [ ] Code documentation added
- [ ] Any pending items have an associated Issue
Complete description
Tracking Issue
https://github.com/flyteorg/flyte/issues/3574 https://github.com/flyteorg/flyte/issues/3467 https://github.com/flyteorg/flyte/issues/3399
Codecov Report
Merging #552 (26ec924) into master (cca9e17) will increase coverage by
0.90%. The diff coverage is50.11%.
:exclamation: Current head 26ec924 differs from pull request most recent head f6fa4ca. Consider uploading reports for the commit f6fa4ca to get more accurate results
@@ Coverage Diff @@
## master #552 +/- ##
==========================================
+ Coverage 58.78% 59.68% +0.90%
==========================================
Files 170 170
Lines 16055 13413 -2642
==========================================
- Hits 9438 8006 -1432
+ Misses 5783 4557 -1226
- Partials 834 850 +16
| Flag | Coverage Δ | |
|---|---|---|
| unittests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| pkg/manager/impl/util/shared.go | 69.07% <ø> (+3.55%) |
:arrow_up: |
| pkg/repositories/database.go | 43.33% <ø> (+2.81%) |
:arrow_up: |
| pkg/repositories/gormimpl/resource_repo.go | 56.20% <0.00%> (-7.67%) |
:arrow_down: |
| pkg/rpc/adminservice/base.go | 3.88% <0.00%> (+0.49%) |
:arrow_up: |
| pkg/runtime/application_config_provider.go | 18.18% <0.00%> (-15.16%) |
:arrow_down: |
| pkg/runtime/task_resource_provider.go | 8.69% <0.00%> (-24.64%) |
:arrow_down: |
| pkg/manager/impl/util/resources.go | 50.25% <47.87%> (-40.42%) |
:arrow_down: |
| pkg/manager/impl/resources/resource_manager.go | 64.65% <56.39%> (+0.61%) |
:arrow_up: |
| pkg/workflowengine/impl/prepare_execution.go | 85.88% <70.00%> (+0.88%) |
:arrow_up: |
| pkg/manager/impl/execution_manager.go | 72.26% <92.30%> (+2.25%) |
:arrow_up: |
| ... and 4 more |
Personally I'm not in favor of "merging" these settings across hierarchies and I prefer that those are applied as a whole from the most specific level. While I can also see the appeal of just overriding some settings in higher/more specific levels, IMHO this just makes it more implicit (hence more complicated and harder to understand) with only a small benefit. But in the end this is not so important to me.
Either way the important part is: it needs to be possible to unset specific settings (i.e. cpu limits), either by omission or by explicit null/0.
Also I'm wondering if we should rather leverage k8s limit ranges instead of re-implementing all this in flyteadmin?