workload-collocation-agent Prometheus rules based wss calucation (and wca/cadvisor specifc rules refactor

Prometheus rules based wss calucation (and wca/cadvisor specifc rules refactor - own directories and better names)

Open ppalucki opened this issue 5 years ago • 1 comments

trafficstars

Breaking change:

All the high level and intermedite metrics like app_RESOURCE or app_req have now these three labels:

app - that represents app name
app_namespace - apps are distingushed between namespaces so memcached from namespace is profiled differently from memcached from namespace="bar"
source - depending on monitoring agent used: wca or cadvisor

Refactoring:

all the rules specific for wca or cadvisor were moved into own directories: wca/cadvisor
there are following generic types of rules
- cadvisor/wca app: to generate app_RESOUCES directly from metrics provided by wca/cadvisor: only cpu,mem and mbw_flat (wss moved to separated file)
- cadvisor/wca node: to generate node_capacity
- cadvisor/wca node-pmem: to mock raw low level data virtual_pmem_node (to generate node_capacity for "virtual_node_pmem")
- cadvisor/wca other: any other metrics used for debugging for just for visualization in grafana (not needed for scheduling/annotation or score)
- NEW rules: cadvisor/wca app-wss: to calculate per app_wss
- generic score: uses data provided by "app" and "node" (and node-pmem mocks) to calculate score : app_profile
- generic scheduler: uses data provided by "app" and "node" for scheduling or annotation purposes
- generic apm: recalculates metrics provided by "fluentd" to generate apm_ metrics

TODO:

bug:

[ ] node_mbw_write_weight and thus (pod_mbw_write/task_mbw_write) are improperly calculated https://github.com/ppalucki/owca/blob/ppalucki/prometheus-rules-based-wss/examples/kubernetes/monitoring/prometheus/wca/prometheusrules.wca-app.yaml#L70 https://github.com/ppalucki/owca/blob/ppalucki/prometheus-rules-based-wss/examples/kubernetes/monitoring/prometheus/cadvisor/prometheusrules.cadvisor-app.yaml#L106 grouping is once done by (node, memory) for wca and (memory) for cadvisor for wca node38 is not configured for 2lm so we will not get task_mbw_write on the other side pod_mbw_write will work for cadvisor because there is no node configured (or we get m2m fail) as 2lm (it will match only for 2lm - to virtual_pmem_node) both solutions are wrong:
- in WCA, rule will allow only to profile applications working on 2lm nodes (and we should be able to profile applications running on DRAM only systems),
- in cAdvisor, rule ignores that application was running on 2lm nodes and always uses virtual_pmem_node (which make no sense),
[x] support for namespaces for cadvisor
[x] support for namespaces for wca

optional

[ ] try to not use kube-state-exporter for cadvisor for labels (still required for resource requests and limits)

Nov 03 '20 14:11 ppalucki

Nov 17 '20 09:11 felidadae