koordinator
koordinator copied to clipboard
apis: add MetricPrediction crd
Ⅰ. Describe what this PR does
define metric prediction crd for recommendation and predction.
the following yaml defines a MetricPrediction called mericprediction-sample.
mericprediction-sample spec claims that it needs the resource prediction for a workload.
- prediction is on container level for a Deployment called nginx
- metric types are cpu and memory from metric-server
- using distribution profiler which statistic its history usage
mericprediction-sample status returns the cpu and memory profiling result for all containers of nginx workload.
apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
name: mericprediction-sample
namespace: default
spec:
target:
type: workload
workload:
apiVersion: apps/v1
kind: Deployment
name: nginx
hierarchy:
level: container
metric:
source: metricServer
metricServer:
resources: [cpu, memory]
profilers:
- name: recommendation-sample
model: distribution
distribution:
# args
status:
results:
- profilerName: recommendation-sample
model: distribution
distributionResult:
items:
- id:
level: container
name: nginx-container
resources:
- name: cpu
avg: 6850m
quantiles:
# ...
p95: 7950m
p99: 8900m
stdDev: 759m
firstSampleTime: 2024-01-29T07:15:56Z
lastSampleTime: 2024-01-30T07:15:56Z
totalSamplesCount: 10000
updateTime: 2024-01-30T07:16:56Z
conditions: []
- name: memory
avg: 1000Mi
quantiles:
# ...
p95: 1100Mi
p99: 1200Mi
stdDev: 100Mi
firstSampleTime: 2024-01-29T07:15:56Z
lastSampleTime: 2024-01-30T07:15:56Z
totalSamplesCount: 10000
updateTime: 2024-01-30T07:16:56Z
conditions: []
Ⅱ. Does this pull request fix one issue?
more infos can get from #1880
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
Intergrating with Metric Predition Framework Metric Prediction Framework is kind of a "deep module", providing algorithms and prediction models in backend. There could be multiple profilers built with Metric Prediction as a foundation. Here are some scenarios about how to use the framework.
- Resource Recommender for Workload The spec of Recommendation defines it needs the recommended resources(CPU and memory) for a deployment named nginx-sample, and the recommendResources in status show the result for each container.
apiVersion: analysis.koordinator.sh/v1alpha1
kind: Recommendation
metadata:
name: recommendation-sample
namespace: recommender-sample
spec:
workloadRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-sample
status:
recommendResources:
containerRecommendations:
- containerName: nginx-container
target:
cpu: 4742m
memory: 262144k
The recommendation is calculated with quantile value of history metrics. If the Using the Metric Prediciton as profiling model, the requirement of recommendation-sample can be expressed in MetricPrediction. For different kind of workload, the recommendation can select specified quantile value from MetricPrediction, for example p95 for Deployment and average for Job, then increase with a 10–15% margin for safety.
apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
name: mericprediction-sample
namespace: default
spec:
target:
type: workload
workload:
apiVersion: apps/v1
kind: Deployment
name: nginx-sample
hierarchy:
level: container
metric:
source: metricServer
metricServer:
resources: [cpu, memory]
profilers:
- name: recommendation-sample
model: distribution
distribution:
# args
- Hotspot Prediction by Timeseries Metric Pod orchestration varies over time on node, and each pod has its own cycle on resource usage. The NodeQoS CR below describe the usage prediction according to workload metric prediction based on time series.
apiVersion: analysis.koordinator.sh/v1alpha1
kind: NodeQoS
metadata:
name: node-sample
spec:
usagePredictionPolicy: workloadByTime
status:
usageOverTtime:
- timeWindow: "0~1" # 1~2 hour
max:
cpu: 6039m
memory: 18594k
average:
cpu: 4028m
memory: 15782k
p95:
cpu: 5731m
memory: 18043k
- timeWindow: "1~2" # 1~2 hour
max:
cpu: 6039m
memory: 18594k
average:
cpu: 4028m
memory: 15782k
p95:
cpu: 5731m
memory: 18043k
The usageOverTtime result in Node QoS is aggregated from MetricPredicion of all workloads running on the Node now, so that the descheduler can check whether there are nodes overloaded in near future then rebalance some pods to others.
apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
name: mericprediction-sample
namespace: default
spec:
target: # workload
metric:
source: metricServer
metricServer:
resources: [cpu, memory]
prometheus:
- resource: memoryBandwidth
name: container_memory_bandwidth
profilers:
- name: timeseries-sample
model: timeseries-trend
timeseries-trend: # args
- Interference Detection for Workload Otliers Pod may got interference during runtime due to the resource contention on node, which can be analysed through CPI, PSI, CPU schedule latency etc. Specifiy algorithm such as OCSVM in MetricPrediction then the model will be available in status.
apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
name: mericprediction-sample
namespace: default
spec:
target: # workload
metric:
prometheus:
- resource: cpi
name: container_cpi
- resource: psi_cpu
name: container_psi_cpu
- resource: csl
name: container_cpu_scheduling_latency
profilers:
- name: interference-sample
model: OCSVM
ocsvm: # args
The Interference Manager will parse and send the corresonding model of workload to koordlet. koordlet will execute QoS strategies once it finds some pod is an outlier according to recent metrics.
V. Checklist
- [ ] I have written necessary docs and comments
- [ ] I have added necessary unit tests and integration tests
- [ ] All checks passed in
make test
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 67.54%. Comparing base (
07e51fa
) to head (4531674
). Report is 117 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1875 +/- ##
==========================================
+ Coverage 67.23% 67.54% +0.30%
==========================================
Files 410 413 +3
Lines 45662 46072 +410
==========================================
+ Hits 30702 31120 +418
+ Misses 12742 12696 -46
- Partials 2218 2256 +38
Flag | Coverage Δ | |
---|---|---|
unittests | 67.54% <ø> (+0.30%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign hormes after the PR has been reviewed.
You can assign the PR to them by writing /assign @hormes
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Ⅰ. Describe what this PR does
define metric prediction crd for recommendation and predction.
the following yaml defines a MetricPrediction called mericprediction-sample.
mericprediction-sample spec claims that it needs the resource prediction for a workload.
- prediction is on container level for a Deployment called nginx
- metric types are cpu and memory from metric-server
- using distribution profiler which statistic its history usage
mericprediction-sample status returns the cpu and memory profiling result for all containers of nginx workload.
apiVersion: analysis.koordinator.sh/v1alpha1 kind: MetricPrediction metadata: name: mericprediction-sample namespace: default spec: target: type: workload workload: apiVersion: apps/v1 kind: Deployment name: nginx hierarchy: level: container metric: source: metricServer metricServer: resources: [cpu, memory] profilers: - name: recommendation-sample model: distribution distribution: # args status: results: - profilerName: recommendation-sample model: distribution distributionResult: items: - id: level: container name: nginx-container resources: - name: cpu avg: 6850m quantiles: # ... p95: 7950m p99: 8900m stdDev: 759m firstSampleTime: 2024-01-29T07:15:56Z lastSampleTime: 2024-01-30T07:15:56Z totalSamplesCount: 10000 updateTime: 2024-01-30T07:16:56Z conditions: [] - name: memory avg: 1000Mi quantiles: # ... p95: 1100Mi p99: 1200Mi stdDev: 100Mi firstSampleTime: 2024-01-29T07:15:56Z lastSampleTime: 2024-01-30T07:15:56Z totalSamplesCount: 10000 updateTime: 2024-01-30T07:16:56Z conditions: []
Ⅱ. Does this pull request fix one issue?
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
- [ ] I have written necessary docs and comments
- [ ] I have added necessary unit tests and integration tests
- [ ] All checks passed in
make test
typo: mericprediction
-> metricprediction
Add some user stories to help understand how the API is used
corresonding
udpated with more user stories.
In the case where there is another layer in the usage scenario mentioned earlier, does MetricPrediction need to be a CRD?
In the case where there is another layer in the usage scenario mentioned earlier, does MetricPrediction need to be a CRD?
@hormes The Recommendation controller in Koordinator does not need create MetricPrediction CR in APIServer, which means the MetricPrediction is a internal
protocol in this scenario, converting Recommendation
CR to MetricPrediction INTERNAL for framework.
In the following scenarios MetricPrediction CR will be created:
- An external controller want to use the Prediction module of Koordiantor, then MetricPrediction CRD acts as an API between the external controller and Koordiantor.
- Before developing a new profiler controller, MetricPrediction will be created for performing experiments and demo before implementation. For example we need to compare whether to use ARIMA or Prophet algorithm in NodeQoS controller.
First we will support usage scenario, and the development will take two steps:
- MetricPrediction framework with
Distribution
model, useing the resource prediction scenario to verify the framework works well. Then the framework can be extended with more algorithm models such as Interference Detection. - Recommendation controller based on MetricPrediction framework, considering workload type(Job/Service), OOM event, etc.
New changes are detected. LGTM label has been removed.
/hold until we have implemented first user strory
This issue has been automatically marked as stale because it has not had recent activity. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied, the issue is closed You can: - Mark this issue or PR as fresh with
/remove-lifecycle stale
- Close this issue or PR with
/close
Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied, the issue is closed You can: - Reopen this PR with
/reopen
Thank you for your contributions.