kepler icon indicating copy to clipboard operation
kepler copied to clipboard

ARM support for AWS Graviton v2, v3

Open jjo opened this issue 8 months ago • 0 comments

What would you like to be added?

ARM support for AWS Graviton v2, v3

Why is this needed?

Expanding Kepler support to more ARM architectures will be very beneficial, else we'd be subsetting our energy observability features to x64 -only, especially for commonly used architectures in some cloud providers: for AWS these are GravitonV2 for the 𝑥6g... (c6g, m6g, etc) and GravitonV3 for 𝑥7g... (c7g, m7g, etc).

Worth noting previous issue at https://github.com/sustainable-computing-io/kepler/issues/482#issuecomment-1569995309.

I tried deploying kepler-0.7.10 on an m6g instance with below details:

[root@ip-10-60-5-69 ~]# lscpu 
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Model name:          Neoverse-N1
Stepping:            r3p1
BogoMIPS:            243.75
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-31
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
[root@ip-10-60-5-69 ~]# uname -r
5.10.215-203.850.amzn2.aarch64

It crashed with the below log tail (full log at https://0x0.st/XTMr.txt):

[...]
libbpf: failed to open '/sys/kernel/debug/tracing/events/writeback/writeback_dirty_folio/id': No such file or directory
libbpf: failed to determine tracepoint 'writeback/writeback_dirty_folio' perf event ID: No such file or directory
libbpf: prog 'kepler_write_page_trace': failed to create tracepoint 'writeback/writeback_dirty_folio' perf event: No such file or directory
W0618 20:16:47.934582       1 exporter.go:215] failed to attach tp/writeback/writeback_dirty_folio: failed to attach tracepoint writeback_dirty_folio to program kepler_write_page_trace: no such file or directory. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs.
libbpf: prog 'kepler_read_page_trace': failed to attach: ERROR: strerror_r(-524)=22
W0618 20:16:47.934648       1 exporter.go:227] failed to attach fentry/mark_page_accessed: failed to attach program: errno 524. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
I0618 20:16:48.149175       1 exporter.go:270] Successfully load eBPF module from libbpf object
I0618 20:16:48.149206       1 exporter.go:116] Initializing the GPU collector
I0618 20:16:48.149517       1 watcher.go:67] Using in cluster k8s config
I0618 20:16:48.650362       1 watcher.go:138] k8s APIserver watcher was started
I0618 20:16:48.650507       1 prometheus_collector.go:92] Registered Container Prometheus metrics
I0618 20:16:48.650548       1 prometheus_collector.go:97] Registered VM Prometheus metrics
I0618 20:16:48.650569       1 prometheus_collector.go:101] Registered Node Prometheus metrics
panic: runtime error: invalid memory address or nil pointer dereference

/cc @nikimanoledaki

jjo avatar Jun 18 '24 20:06 jjo