kepler
kepler copied to clipboard
Missing support for linux/arm64/v8
What happened?
First of all: Great work. The operator is working great for me 👍. Thx a lot for your great work.
When i deploy the operator to my kubernetes cluster (k3s) i receive the error:
Failed to pull image "quay.io/sustainable_computing_io/kepler:release-0.6.1": no matching manifest for linux/arm64/v8 in the manifest list entries
As far as I can see, there is only a amd64 image in the container registry. Do you pepole have any plans on multiarch support?
What did you expect to happen?
A working image pull for arm64 systems. In this case: a M2 Mac.
How can we reproduce it (as minimally and precisely as possible)?
Run the helm chart on a m1 or m2 mac in docker-desktop-kubernetes.
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
v1.28.2
Cloud provider or bare metal
OS version
# On Linux:
$ cat /etc/os-release
macos 13.4.1
$ uname -a
Darwin MBP 22.5.0 Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000 arm64
Install tools
Kepler deployment config
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
@beneiltis thanks for testing kepler on ARM. We are still working on multi arch image build at the moment. The current ARM platform kepler support is Ampere, since Ampere CPU has a hwmon that reports power consumption. If we know how to get power readings from e.g. apple silicon, we would love to support it too.
cc @vimalk78
Good to know @rootfs. I am no specialist in this field but I can waste my night by looking into it. maybe I find something to contribute.
@rootfs To enable arm64 with latest kepler version, we need to enhance cpuid install approach in our build process as this package just have x86 version which fails with arm64 image build. here are some suggestions:
- remove cpuid install from build image.
- remove file copy for cpuid from build image to kepler image.
- install cpuid during image build for x86 only. Optional, for build performance considering... I am not sure we need all features from elfutils or does there any way we can install elfutils form rpm packages? for both x86, arm64 and s390?
there seems to be no release of cpuid for arm
sh-5.1# yum install -y cpuid
Updating Subscription Management repositories.
Unable to read consumer identity
This system is not registered with an entitlement server. You can use subscription-manager to register.
Extra Packages for Enterprise Linux 9 - aarch64 557 kB/s | 20 MB 00:36
Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - aarch64 572 B/s | 2.5 kB 00:04
No match for argument: cpuid
Error: Unable to find a match: cpuid
there seems to be no release of
cpuidfor armsh-5.1# yum install -y cpuid Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Extra Packages for Enterprise Linux 9 - aarch64 557 kB/s | 20 MB 00:36 Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - aarch64 572 B/s | 2.5 kB 00:04 No match for argument: cpuid Error: Unable to find a match: cpuid
https://github.com/sustainable-computing-io/kepler/pull/1169 try to support multiple arch for base image.
Great. Thx @SamYuan1990 :-)
What do we now have to do in order to make in run on apple silicon? i only found
powermetrics --show-process-energy which is a good starting point for me.
Ok I checked the repo out and due to the changes I can now build the Dockerfile.builder and Dockerfile and run it on my mac with the correct architecture. Awesome.
When I replace the daemonset image with my self-build-image I get following errors:
I0116 07:44:19.822600 1 gpu.go:46] Failed to init nvml, err: failed to init nvml. ERROR_LIBRARY_NOT_FOUND E0116 07:44:19.824146 1 utils.go:140] getCPUArch failure: open /sys/devices/cpu/caps/pmu_name: no such file or directory I0116 07:44:19.826799 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127 I0116 07:44:19.839222 1 exporter.go:155] Kepler running on version: 1.20.10 I0116 07:44:19.839256 1 config.go:275] using gCgroup ID in the BPF program: true I0116 07:44:19.839288 1 config.go:277] kernel version: 6.5 I0116 07:44:19.839340 1 exporter.go:167] LibbpfBuilt: true, BccBuilt: false I0116 07:44:19.839343 1 exporter.go:186] EnabledBPFBatchDelete: true I0116 07:44:19.839382 1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory I0116 07:44:19.839442 1 power.go:72] Unable to obtain power, use estimate method I0116 07:44:19.839462 1 redfish.go:169] failed to get redfish credential file path I0116 07:44:19.839485 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I0116 07:44:19.839513 1 power.go:72] using none to obtain power I0116 07:44:19.839524 1 exporter.go:201] Initializing the GPU collector I0116 07:44:25.841021 1 watcher.go:66] Using in cluster k8s config libbpf: map 'cpu_instructions': found type = 2. libbpf: map 'cpu_instructions': found key [6], sz = 4. libbpf: map 'cpu_instructions': found value [12], sz = 8. libbpf: map 'cpu_instructions': found max_entries = 128. libbpf: map 'cache_miss_hc_reader': at sec_idx 13, offset 256. libbpf: map 'cache_miss_hc_reader': found type = 4. libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4. libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4. libbpf: map 'cache_miss_hc_reader': found max_entries = 128. libbpf: map 'cache_miss': at sec_idx 13, offset 288. libbpf: map 'cache_miss': found type = 2. libbpf: map 'cache_miss': found key [6], sz = 4. libbpf: map 'cache_miss': found value [12], sz = 8. libbpf: map 'cache_miss': found max_entries = 128. libbpf: map 'cpu_freq_array': at sec_idx 13, offset 320. libbpf: map 'cpu_freq_array': found type = 2. libbpf: map 'cpu_freq_array': found key [6], sz = 4. libbpf: map 'cpu_freq_array': found value [6], sz = 4. libbpf: map 'cpu_freq_array': found max_entries = 128. libbpf: map 'arm64_ke.data' (global data): at sec_idx 11, offset 0, flags 400. libbpf: map 11 is "arm64_ke.data" libbpf: map 'arm64_ke.bss' (global data): at sec_idx 12, offset 0, flags 400. libbpf: map 12 is "arm64_ke.bss" libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch' libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #2 against 'sample_rate' libbpf: prog 'kepler_trace': found data map 11 (arm64_ke.data, sec 11, off 0) for insn 2 libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #6 against 'counter_sched_switch' libbpf: prog 'kepler_trace': found data map 12 (arm64_ke.bss, sec 12, off 0) for insn 6 libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #32 against 'cpu_cycles_hc_reader' libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 13, off 64) for insn #32 libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #51 against 'cpu_cycles' libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 13, off 96) for insn #51 libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #65 against 'cpu_cycles' libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 13, off 96) for insn #65 libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #70 against 'cpu_ref_cycles_hc_reader' libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 13, off 128) for insn #70 libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #83 against 'cpu_ref_cycles' libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 13, off 160) for insn #83 libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #97 against 'cpu_ref_cycles' libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 13, off 160) for insn #97 libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #102 against 'cpu_instructions_hc_reader' libbpf: prog 'kepler_trace': found map 6 (cpu_instructions_hc_reader, sec 13, off 192) for insn #102 libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #117 against 'cpu_instructions' libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 13, off 224) for insn #117 libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #129 against 'cpu_instructions' libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 13, off 224) for insn #129 libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #134 against 'cache_miss_hc_reader' libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 13, off 256) for insn #134 libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #146 against 'cache_miss' libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 13, off 288) for insn #146 libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #160 against 'cache_miss' libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 13, off 288) for insn #160 libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #168 against 'cpu_freq_array' libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 13, off 320) for insn #168 libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #182 against 'cpu_freq_array' libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 13, off 320) for insn #182 libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #194 against 'cpu_freq_array' libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 13, off 320) for insn #194 libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #218 against 'cpu_freq_array' libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 13, off 320) for insn #218 libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #227 against 'pid_time' libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 13, off 32) for insn #227 libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #235 against 'pid_time' libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 13, off 32) for insn #235 libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #247 against 'pid_time' libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 13, off 32) for insn #247 libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #253 against 'processes' libbpf: prog 'kepler_trace': found map 0 (processes, sec 13, off 0) for insn #253 libbpf: sec '.reltracepoint/sched/sched_switch': relo #22: insn #273 against 'processes' libbpf: prog 'kepler_trace': found map 0 (processes, sec 13, off 0) for insn #273 libbpf: sec '.reltracepoint/sched/sched_switch': relo #23: insn #300 against 'processes' libbpf: prog 'kepler_trace': found map 0 (processes, sec 13, off 0) for insn #300 libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry' libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes' libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 13, off 0) for insn #5 libbpf: sec '.relkprobe/mark_page_accessed': collecting relocation for section(7) 'kprobe/mark_page_accessed' libbpf: sec '.relkprobe/mark_page_accessed': relo #0: insn #4 against 'processes' libbpf: prog 'kprobe__mark_page_accessed': found map 0 (processes, sec 13, off 0) for insn #4 libbpf: sec '.relkprobe/set_page_dirty': collecting relocation for section(9) 'kprobe/set_page_dirty' libbpf: sec '.relkprobe/set_page_dirty': relo #0: insn #4 against 'processes' libbpf: prog 'kprobe__set_page_dirty': found map 0 (processes, sec 13, off 0) for insn #4 libbpf: map 'processes': created successfully, fd=9 libbpf: map 'pid_time': created successfully, fd=10 libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=11 libbpf: map 'cpu_cycles': created successfully, fd=12 libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=13 libbpf: map 'cpu_ref_cycles': created successfully, fd=14 libbpf: map 'cpu_instructions_hc_reader': created successfully, fd=15 libbpf: map 'cpu_instructions': created successfully, fd=16 libbpf: map 'cache_miss_hc_reader': created successfully, fd=17 libbpf: map 'cache_miss': created successfully, fd=18 libbpf: map 'cpu_freq_array': created successfully, fd=19 libbpf: map 'arm64_ke.data': created successfully, fd=20 libbpf: map 'arm64_ke.bss': created successfully, fd=21 libbpf: failed to open '/sys/kernel/tracing/events/sched/sched_switch/id': No such file or directory libbpf: failed to determine tracepoint 'sched/sched_switch' perf event ID: No such file or directory libbpf: prog 'kepler_trace': failed to create tracepoint 'sched/sched_switch' perf event: No such file or directory I0116 07:44:25.953230 1 bpf_perf.go:135] failed to attach bpf with libbpf: failed to attach sched/sched_switch: failed to attach tracepoint sched_switch to program kepler_trace: no such file or directory, fall back to bcc attachment I0116 07:44:25.953312 1 exporter.go:237] failed to start : failed to attach bpf assets: no bcc build tag I0116 07:44:25.953385 1 exporter.go:269] Started Kepler in 6.114231877s
As I can see from the github-workflows you are not using arm-runners. If you like we can contribute our runners for the project. We would be more than happy to help :-)
@beneiltis you are more than welcome to contribute to the project :)
there seems to be no release of
cpuidfor arm
Please be noted that cpuid is a tool for detecting x86 CPU features/capabilities, the author of cpuid is Tod Ellen.
You could see ARM related functionalities on his website also.
I believe it is more useful than current code in Kepler for ARM CPU model identification.
Furthermore, in my recent feature commit for CPUID alternative solution, since cpuid is not available for ARM platforms, we can also use the ARM CPU section in cpus.yaml to maintain the known ARM CPU model as an alternative workaround.
@beneiltis thanks for the input! what platform did you run kepler and ebpf?
Great. Thx @SamYuan1990 :-) What do we now have to do in order to make in run on apple silicon? i only found
powermetrics --show-process-energywhich is a good starting point for me.
Well, to be honest, as we discussed on kepler community meeting, I just made the arm64 image there with latest code base. as @jiere said, I suppose we need to further discuss. as cpuid is just for x86, hence maybe we need a build tag for that part of code to avoid it breaks arm64 or s390x(@jiangphcn in loop here for notice him) as @rootfs said and without my misunderstand, currently the arm64 version of code just support redfish... and you need to config it correctly.
@beneiltis @YaSuenag latest kepler supports multiarch (thanks to @SamYuan1990 ), can you give it a try?
I am using Apple M1 Max, macOS 13.4.1.
Yes the container now starts correctly. Awesine :-)
But I dont get any energy readings. But I did not exspect that because kepler does not support Apple Silicon right?
But now if i update my legacy testing cluster (12 year old hardware) I get a nil point dereference. I post it here but I can also create a new ticket for that:
`I0125 15:16:47.276278 1 libbpf_attacher.go:188] Successfully load eBPF module from libbpf object I0125 15:16:47.276372 1 process_energy.go:114] Using the Ratio/DynPower Power Model to estimate Process Platform Power I0125 15:16:47.276383 1 process_energy.go:115] Process feature names: [cpu_instructions] I0125 15:16:47.276458 1 process_energy.go:124] Using the Ratio/DynPower Power Model to estimate Process Component Power I0125 15:16:47.276467 1 process_energy.go:125] Process feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] I0125 15:16:47.276484 1 process_energy.go:114] Using the Ratio/DynPower Power Model to estimate Process Platform Power I0125 15:16:47.276492 1 process_energy.go:115] Process feature names: [cpu_instructions] I0125 15:16:47.276507 1 process_energy.go:124] Using the Ratio/DynPower Power Model to estimate Process Component Power I0125 15:16:47.276533 1 process_energy.go:125] Process feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] I0125 15:16:47.276877 1 node_platform_energy.go:52] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power I0125 15:16:47.277032 1 exporter.go:269] Started Kepler in 155.044557ms panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x827598]
goroutine 16 [running]: github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x1998506, 0x7}, 0x0) /workspace/pkg/collector/stats/types/types.go:108 +0x38 github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x178f120?, 0xc0003c26c0, 0x1ba21?) /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:43 +0x117 github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics(0xc0001776f0?) /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:121 +0x69c github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc000477950?, 0x0?) /workspace/pkg/collector/metric_collector.go:200 +0x54 github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc000477950) /workspace/pkg/collector/metric_collector.go:159 +0x56 github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?) /workspace/pkg/collector/metric_collector.go:110 +0x48 github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1() /workspace/pkg/manager/manager.go:73 +0x7b created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start /workspace/pkg/manager/manager.go:65 +0x6a Stream closed EOF for mogenius/kepler-s5s6z (kepler-exporter)`
@beneiltis kepler doesn't have Apple M1 energy sensor yet, it is something we haven't started.
btw, I have to disable the arm64 image build because the libbpf has an architecture dependency. Will let you know when this is fixed.
I do not currently have the Arm server at my disposal. So I cannot evaluate the image now, sorry. (I believe I will do in few monthes...)
fix is in #1255
Ok arm64 is now working 👍 Now I guess we/I need to come up with something to support apple silicon (M1-M3). To be honest I guess this is some kind of edge-case (who would run a real cluster on her/his macbook) but it would be awesome to demonstrate keplers abilities on a local setup. the compute/power ratio of these systems is realy incredible.
Sorry for the late reply.
I tried the latest Kepler ( quay.io/sustainable_computing_io/kepler:release-0.7.8 ) on Fedora 39 on Altra Q80-30 (HPE RL300) with Kubernetes v1.29. It looks good, but I saw some strange logs:
I0401 06:54:57.584191 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:54:57.584199 1 power.go:67] use Ampere Xgene sysfs to obtain power
<snip>
I0401 06:54:57.597915 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:54:57.598157 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:54:57.598406 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:54:57.598671 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
<snip>
libbpf: prog 'kprobe__finish_task_switch': failed to create kprobe 'finish_task_switch+0x0' perf event: No such file or directory
I0401 06:54:57.730645 1 libbpf_attacher.go:128] failed to attach kprobe/finish_task_switch: failed to attach finish_task_switch k(ret)probe to program kprobe__finish_task_switch: no such file or directory. Try finish_task_switch.isra.0 -> (1)
<snip>
I0401 06:54:57.767582 1 libbpf_attacher.go:195] Successfully load eBPF module from libbpf object
I0401 06:54:57.767626 1 process_energy.go:114] Using the Ratio/DynPower Power Model to estimate Process Platform Power -> (2)
<snip>
I0401 06:54:57.768211 1 exporter.go:270] Started Kepler in 184.47496ms
I0401 06:55:00.826484 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:00.827290 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:03.793142 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:03.793628 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:06.787461 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:06.787998 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:09.788545 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
I0401 06:55:09.789155 1 apm_xgene_sysfs.go:61] Found power input file: /sys/class/hwmon/hwmon0/power1_input
<snip>
-> (3)
- It looks like to fail to probe
finish_task_switch. Should we fix this? On my kernel (6.7.10-200.fc39.aarch64) hasfinish_task_switch.isra.0in/proc/kallsyms - Is Power Model use by default? I'd like to use real measurement data only. Should I tweak something in
values.yaml? I installed Kepler via Helm by default (novalues.yaml). - I saw a lot of log entries about
apm_xgene_sysfs.go:61. It seems to occur twice by 3 seconds. Is it bug?
@YaSuenag thanks for the update.
For 1) kepler loads ebpf program and first tries to attach finish_task_switch, if failed, then attaches finish_task_switch.isra.0. The error messagefailed to attach kprobe/finish_task_switch: failed to attach finish_task_switch k(ret)probe to program kprobe__finish_task_switch: no such file or directory. Try finish_task_switch.isra.0 -> (1) is benign.
For 2). the power model is not used in this case since kepler runs on baremetal env
For 3) Yes, it is a bug. Can you create a PR and change this line the verbosity level from 1 to e.g. 5? That'll make the logs go away. Thanks
Thanks @rootfs ! I opened PR #1322. It works fine on my environment.