pcm icon indicating copy to clipboard operation
pcm copied to clipboard

Helm chart for pcm-sensor-server

Open ppalucki opened this issue 10 months ago • 6 comments

Initial version of Helm chart for deploying PCM

Features include:

  • Two methods of collecting metrics (indirect the default, by using Linux interfaces perf/resctrl) or direct by accessing msr registers directly through linux msr module,
  • Support for bare-metal (full set of metrics) and VM cloud instances (limited set of metrics),
  • Optional integration with Prometheus opeartor, node-feature-discovery and NRI balloons policy plugin,
  • Other described here: https://github.com/ppalucki/pcm/blob/ppalucki/helm/deployment/pcm/README.md

TODO:

Must have:

  • [x] consider using CAP_PERFMON (kernel 5.8+) instead of SYS_ADMIN with default perf_event_paranoid=-2 (SYS_ADMIN is just for backward compatibility) more info here
  • [x] direct method (with msr access) bug and privilaged (described in alternative methods) on kernels 5.9+ /sys/modules/msr/parameters/allow_writes is not accessible (msr.cpp doesn't use tryOpen (nor readSys/writeSys) with /pcm prefix ends up with non-critical warning
  • [x] Resolve issue with mounting pcm/resctrl (EDITED: reason some missconfiguration with kind or sys or conflict with base_runtime_spec, docker still doesnt work)
  • [x] Check if energy metrics can be accessible through perf subsystem (so far no available for indirect method but perf list | grep -i energy returns power/energy-pkg/ and power/energy-ram/ possible development to PCM required - are available through perf subsystem perf_even_open syscall and enumerating/discovering /sys/devices/power/events PMU type=9 or through power cap framework similary to node_exporter/rapl collector, reading data directly from sysfs like this cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
  • [x] Reorganize README to move optional feature down below (or separate documents)
  • [x] Optional vertical pod autoscaler and proper resource sizing (cpu/mem requests/limits)
  • [x] automation for e2e and linter helm chart,
  • [x] Implement Helm chart test pods
  • [ ] Expose metrics with TLS by default (node-exporter can use 3party proxy example see example
  • [ ] Remove some debug code (debugPcm/debugSleep/prints, extra testing packages from docker image)
  • [ ] clean up unused mounts sysfcMount (we already has sys mounted - just to be sure it is usable)
  • [ ] Fix NOTES (WIP)
  • [ ] Docs: consider documentation values in a form of table (e.g. https://artifacthub.io/packages/helm/bitnami/prometheus#parameters or https://artifacthub.io/packages/helm/bitnami/node-exporter#parameters) or at least something similar to https://artifacthub.io/packages/helm/prometheus-community/prometheus#configuration
  • [ ] "Broken pipe, Socket operation on non-socket" errors when PodMonitor is enabled (ulimit/permissions?!)
  • [ ] Grafana panel automatically deployed (currently only there are instructions, limitation grafana panel isn't universal)
  • [ ] GitHub actions for linter/security scanners (added Mafefile with linter)
  • [ ] Optional refactor extra features (to be discussed during review): node-feature-discovery, NRI interegration only as extra values for generic fields (annotations, nodeSelector/nodeAffinity)
  • [ ] Test liveness/readiness probes work as expected in failure case,

Testing:

  • [ ] NMI watch is properly reconfigued (disabled/enabled) basd on VM/env flags (now it is set to RO and it works?!)
  • [ ] Testing in Cluster Manager Systems like (e.g. Ranger,Gardener) different node types VM(1socket,all sockets), bare-metal
  • [ ] Test in different cloud GCP/Azure/AWS

Follow up tasks (in another PRs ideas):

  • pcm-sensor-server: expose missing metrics available in other tools:
    • energy,
    • PCI-e as from pcm-pcie,
    • numa-split
  • Change metrics names (follow Prometheus best practices) - it will be separate PR
  • init container to check permission for all required components (devices/CPU)
  • Configurable collection period (and aggregators history) to limit cpu/memory usage (TODO: do some memory profiling first)

ppalucki avatar Apr 26 '24 07:04 ppalucki

It would be helpful if you could put custom labels on the podMonitor to let prometheus filter based on that.

jcpunk avatar Apr 26 '24 14:04 jcpunk

the FreeBSD check failure is unrelated

rdementi avatar Apr 29 '24 11:04 rdementi

It would be helpful if you could put custom labels on the podMonitor to let prometheus filter based on that.

@jcpunk Thanks for this comment, that is very helpfull indeed - having that there is no longer need to hack prometheus-operator chart to disable podMonitorSelector - I added comments in README/values file about this (check this changes in commit for details (https://github.com/intel/pcm/pull/727/commits/7f2c7073b1be7133140ff58c749e07233a21a81d#diff-618d3b78482c88190c469bb01731f774bb931bcdc14db7b8980691f5745ba08aR151-R152) or this documentation added in values https://github.com/intel/pcm/blob/7f2c7073b1be7133140ff58c749e07233a21a81d/deployment/pcm/values.yaml#L87-L91

Anyway help for pointing this.

ppalucki avatar May 08 '24 17:05 ppalucki

I only checked the commit that adds the Helm chart and it looks good to me. The only thing I wondered is do you really expect users to modify all the available values in the values.yaml. To me some looked not very necessary but I can't judge since I'm not familiar with PCM. LGTM

That is the trade of between flexibility and complexity that I'm finding hard to balance.

I see two options here:

  1. Limit number of features (supported collection methods direct/indirect, podmonitor, NFD) - worth considering, but is hard to predict which features are valueable for others (even for me modest needs I see cases where I want all of them).

  2. Move logic from values to templates (less "raw" values) - e.g. instead of requiring to explicit enviornment values directly, expose values that represent set of them like "--set directCollection=false or true" will set proper combination of (PCM_NO_MSR, PMC_NO_PERF) - but I alread tried that and it would add a lot of complexity in templates (e.g. PCM_NO_MSR conflicts with PCM_NO_PERF and both are related to PCM_USE_UNCORE_PERF), imagine I could do something like this inside template:

if direct:
  if vm:
    if rdt:
       PCM_NO_MSR=0, PCM_NO_PERF=1, PCM_USE_UNCORE_PERF=0
  else:
       PCM_NO_MSR=1 PCM_NO_PERF=0, PCM_USE_UNCORE_PERF=1
  ....
else:
  if vm:
    PCM_NO_MSR=0, PCM_NO_PERF=1, PCM_USE_UNCORE_PERF=0
  else:
    PCM_NO_MSR=1 PCM_NO_PERF=0, PCM_USE_UNCORE_PERF=1    

and try to handle all possible (both proper and inproper combination) - but now rewrite this using go/helm template language - not very maintanable nor readable - I don't really want chart to be validator of possible PCM envs and I'm terified of hardcoding all proper possible combinations - in other words - if you don't like defaults (or other example value files) - then you're responsible for validating that pcm-sensor-server binary will run - but finding right combination is your job in your case.

(there is also another option - e.g. allow to pass *any enviornment/options to pcm - but that is discouraged security practice)

I the end, I decided to just allow use of this pcm chart to pass those "all PCM suppored" environment variables directly - and tried to cover possible value of combinations as different values files.

One more comment, I agree that value file is alread quite big, but not yet as scary as the I see in prometheus node exporter official chart values.yaml - it is 500 lines vs my so far about 100 (but I miss some feature though RBACs, vertical scaling) which I'm using as example of good practicies :)

ppalucki avatar May 08 '24 18:05 ppalucki

Hello, I've attempted to deploy the new Helm charts introduced in this pull request, but I'm encountering an error during the installation process. Here are the details:

image

I think that I have successfully made all steps before 6) Deploy PCM helm chart in paragraph Validation on local kind cluster Could you add information on how to deal with it when this error happens?

Damenus avatar Jun 26 '24 11:06 Damenus

I was trying to run e2e test make e2e-default, but it doesn't work without extra steps. I got error in pod pcm:

Linux Perf: Error when programming INST_RETIRED, error: Permission denied with config 0x1 config1 0x0 for tid -1 leader -1
try running with environment variable PCM_NO_PERF=1

Please, add information about how to add this env

Also, the file ./_kind_with_registry.sh which is downloading during the test execution from Makefile is broken. The Cluster yaml definition is not valid. Removing a broken line helped.

Damenus avatar Jul 17 '24 12:07 Damenus