workload-collocation-agent
workload-collocation-agent copied to clipboard
Prometheus rules based wss calucation (and wca/cadvisor specifc rules refactor - own directories and better names)
Breaking change:
- All the high level and intermedite metrics like app_RESOURCE or app_req have now these three labels:
app- that represents app nameapp_namespace- apps are distingushed between namespaces so memcached from namespace is profiled differently from memcached from namespace="bar"source- depending on monitoring agent used: wca or cadvisor
Refactoring:
- all the rules specific for wca or cadvisor were moved into own directories: wca/cadvisor
- there are following generic types of rules
- cadvisor/wca app: to generate app_RESOUCES directly from metrics provided by wca/cadvisor: only cpu,mem and mbw_flat (wss moved to separated file)
- cadvisor/wca node: to generate node_capacity
- cadvisor/wca node-pmem: to mock raw low level data virtual_pmem_node (to generate node_capacity for "virtual_node_pmem")
- cadvisor/wca other: any other metrics used for debugging for just for visualization in grafana (not needed for scheduling/annotation or score)
- NEW rules: cadvisor/wca app-wss: to calculate per app_wss
- generic score: uses data provided by "app" and "node" (and node-pmem mocks) to calculate score : app_profile
- generic scheduler: uses data provided by "app" and "node" for scheduling or annotation purposes
- generic apm: recalculates metrics provided by "fluentd" to generate apm_ metrics
TODO:
bug:
-
[ ] node_mbw_write_weight and thus (pod_mbw_write/task_mbw_write) are improperly calculated https://github.com/ppalucki/owca/blob/ppalucki/prometheus-rules-based-wss/examples/kubernetes/monitoring/prometheus/wca/prometheusrules.wca-app.yaml#L70 https://github.com/ppalucki/owca/blob/ppalucki/prometheus-rules-based-wss/examples/kubernetes/monitoring/prometheus/cadvisor/prometheusrules.cadvisor-app.yaml#L106 grouping is once done by (node, memory) for wca and (memory) for cadvisor for wca node38 is not configured for 2lm so we will not get task_mbw_write on the other side pod_mbw_write will work for cadvisor because there is no node configured (or we get m2m fail) as 2lm (it will match only for 2lm - to virtual_pmem_node) both solutions are wrong:
- in WCA, rule will allow only to profile applications working on 2lm nodes (and we should be able to profile applications running on DRAM only systems),
- in cAdvisor, rule ignores that application was running on 2lm nodes and always uses virtual_pmem_node (which make no sense),
-
[x] support for namespaces for cadvisor
-
[x] support for namespaces for wca
optional
- [ ] try to not use kube-state-exporter for cadvisor for labels (still required for resource requests and limits)
- Why you change history length to 1h from 7d?