kepler icon indicating copy to clipboard operation
kepler copied to clipboard

Incorrect measurements on ARM (Ampere Altra Max)

Open simonarys opened this issue 2 years ago • 10 comments

What happened?

We are interested in running Kepler on an ARM Ampere Altra Max machine (BM). We managed to successfully build the Kepler image from the Dockerfiles available in the build/ folder on the main GitHub branch (hash: 88c82f384f10ba4deb39675b2c88450bc28ee7b8). We then ran the image on a Kubernetes cluster, both on a x86 and the ARM machine. However on the ARM one, we've observed an anomaly in the Grafana dashboard, which is indicating unexpectedly low energy consumption metrics and the "system" namespace is showing unrealistically high power consumptions (more than 1 million W). Moreover, the DRAM energy measurements are always 0. See pictures below.

Picture of very low energy consumption metrics with DRAM at 0 (namespace kepler) Picture of really high energy consumption with DRAM at 0 (namespace system)

We would appreciate any insights or guidance on potential ARM-specific optimizations or configurations that might be necessary to ensure accurate energy consumption measurements.

To aid in troubleshooting, we attached logs and configuration details. Please let us know if further information is needed.

What did you expect to happen?

We expected similar results to those obtained when running Kepler on a x86 Intel machine, since we followed the same steps on both architecture to build and deploy Kepler. On the x86 Intel machine we obtained plausible results, not so far from our PDU's power outlet metrics.

How can we reproduce it (as minimally and precisely as possible)?

We had to change a few lines in the Dockerfiles to use the ARM architecture instead of x86 because only the Dockerfile.bcc.base has an ARM version available in the GitHub repo.

We built the following images using the Dockerfiles from the /build folder in this order:

  1. bcc.base
  2. bcc.builder
  3. kernel-source-images
  4. bcc.kepler
  5. manifest

For bcc.base, we built the dockerfile with an arm64 extension that is already in the GitHub repository.

For bcc.builder, we replaced the FROM to use the bcc.base image we just built and replaced the amd64 by arm64 in the line 10:

RUN curl -LO https://go.dev/dl/go1.18.10.linux-arm64.tar.gz; mkdir -p /usr/local; tar -C /usr/local -xvzf go1.18.10.linux-arm64.tar.gz; rm -f go1.18.10.linux-arm64.tar.gz

For kernel-source-image, we replaced the whole file by this and do not use the build-kernel-source-images.sh script:

FROM registry.access.redhat.com/ubi8/ubi

ARG ARCH=aarch64

RUN yum install -y http://mirror.centos.org/centos/8-stream/BaseOS/aarch64/os/Packages/centos-gpg-keys-8-6.el8.noarch.rpm && \
    yum install -y http://mirror.centos.org/centos/8-stream/BaseOS/aarch64/os/Packages/centos-stream-repos-8-6.el8.noarch.rpm

RUN yum install -y kernel-devel

For bcc.kepler, we changed the FROMs of line 1 and 25 to use our previously built images (builder then base) and moved the file to the root of the repository before building it using docker.

For the manifest, firstly, we built the manifest using:

make build-manifest OPTS="CI_DEPLOY PROMETHEUS_DEPLOY"

Then, we replaced the image source at line 152 in _output/generated_manifest/deployment.yaml with our kepler image built in the previous step and uploaded on DockerHub.

Lastly, we deployed the manifest to our empty Kubernetes (Kind) cluster.

Anything else we need to know?

We are using a kind cluster

Kepler pod logs :

I1116 09:52:00.944197    4588 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I1116 09:52:01.039707    4588 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I1116 09:52:01.055616    4588 exporter.go:157] Kepler running on version: bb2b1bb-dirty
I1116 09:52:01.055722    4588 config.go:274] using gCgroup ID in the BPF program: true
I1116 09:52:01.055743    4588 config.go:276] kernel version: 5.15
I1116 09:52:01.055779    4588 exporter.go:169] LibbpfBuilt: false, BccBuilt: true
I1116 09:52:01.055917    4588 config.go:207] kernel source dir is set to /usr/share/kepler/kernel_sources
I1116 09:52:01.055961    4588 exporter.go:188] EnabledBPFBatchDelete: true
I1116 09:52:01.055985    4588 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I1116 09:52:01.056179    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input
I1116 09:52:01.056277    4588 power.go:66] use Ampere Xgene sysfs to obtain power
I1116 09:52:01.056308    4588 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I1116 09:52:01.064097    4588 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I1116 09:52:01.172250    4588 exporter.go:203] Initializing the GPU collector
I1116 09:52:07.175452    4588 watcher.go:66] Using in cluster k8s config
I1116 09:52:07.276265    4588 watcher.go:134] k8s APIserver watcher was started
cannot attach kprobe, probe entry may not exist
I1116 09:52:08.550216    4588 bcc_attacher.go:94] attaching kprobe to finish_task_switch failed, trying finish_task_switch.isra.0 instead
W1116 09:52:08.567229    4588 bcc_attacher.go:113] failed to load kprobe__set_page_dirty: Module: unable to find kprobe__set_page_dirty
ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor
ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor
W1116 09:52:08.758847    4588 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs.
W1116 09:52:08.758962    4588 bcc_attacher.go:125] failed to load kprobe__mark_page_accessed: Module: unable to find kprobe__mark_page_accessed
ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor
W1116 09:52:08.858818    4588 bcc_attacher.go:129] failed to attach kprobe/mark_page_accessed: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
perf_event_open: No such file or directory
W1116 09:52:08.919712    4588 bcc_attacher.go:142] could not attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory. Are you using a VM?
I1116 09:52:08.946937    4588 bcc_attacher.go:150] Successfully load eBPF module from using bcc
I1116 09:52:08.946964    4588 bcc_attacher.go:208] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=128 -DSAMPLE_RATE=0 -DSET_GROUP_ID]
I1116 09:52:08.947046    4588 container_energy.go:114] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I1116 09:52:08.947058    4588 container_energy.go:115] Container feature names: [bpf_cpu_time_us]
I1116 09:52:08.947078    4588 container_energy.go:124] Using the Ratio/DynPower Power Model to estimate Container Component Power
I1116 09:52:08.947089    4588 container_energy.go:125] Container feature names: [bpf_cpu_time_us bpf_cpu_time_us bpf_cpu_time_us   gpu_sm_util]
I1116 09:52:08.947111    4588 process_power.go:113] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I1116 09:52:08.947121    4588 process_power.go:114] Container feature names: [bpf_cpu_time_us]
I1116 09:52:08.947136    4588 process_power.go:123] Using the Ratio/DynPower Power Model to estimate Process Component Power
I1116 09:52:08.947147    4588 process_power.go:124] Container feature names: [bpf_cpu_time_us bpf_cpu_time_us bpf_cpu_time_us   gpu_sm_util]
I1116 09:52:08.947426    4588 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I1116 09:52:08.947695    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input
I1116 09:52:08.947932    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input
I1116 09:52:08.948172    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input
I1116 09:52:08.948325    4588 exporter.go:267] Started Kepler in 7.89294259s
I1116 09:52:16.433631    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input
I1116 09:52:16.434094    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input
I1116 09:52:23.974378    4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input

Kepler image tag

latest-bcc built on ARM by ourselves

Kubernetes version

$ kubectl version
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

Cloud provider or bare metal

Bare metal: Ampere Altra Max

Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 128
  On-line CPU(s) list:  0-127
Vendor ID:              ARM
  Model name:           Neoverse-N1
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 128
    Socket(s):          1
    Stepping:           r3p1
    Frequency boost:    disabled
    CPU max MHz:        3000.0000
    CPU min MHz:        1000.0000
    BogoMIPS:           50.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdr
                        dm lrcpc dcpop asimddp ssbs
Caches (sum of all):    
  L1d:                  8 MiB (128 instances)
  L1i:                  8 MiB (128 instances)
  L2:                   128 MiB (128 instances)

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux calcul9 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:23:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Install tools

Kepler deployment config

Container runtime (CRI) and version (if applicable)

containerd://1.7.1

Related plugins (CNI, CSI, ...) and versions (if applicable)

simonarys avatar Nov 21 '23 17:11 simonarys

We met similar issue on a new Intel platform, when we changed to libbpf based Kepler image, the issue is gone. Please have a try then. Since the latest Kepler image is by default built with libbpf yet.

jiere avatar Dec 05 '23 14:12 jiere

@simonarys please check if the libbpf image fixes this issue. For DRAM power, the current hwmon used by kepler doesn't support DRAM power reporting (https://docs.kernel.org/hwmon/xgene-hwmon.html). We need to support a much newer hwmon (https://docs.kernel.org/hwmon/smpro-hwmon.html) to get DRAM power. But I don't have an Ampere setup right now.

rootfs avatar Dec 05 '23 16:12 rootfs

@simonarys btw, if you build libbpf image for arm64, the latest Kepler build and base images from @vimalk78 are based on ubi9, they support multiarch. It will make arm image much easier.

rootfs avatar Dec 05 '23 16:12 rootfs

@rootfs Thank you for your response. Unfortunately we weren’t able to build Kepler using the base image from @vimalk78 neither on x86 nor on ARM.

We built the Dockerfile.base successfully on x86, and for ARM we simply had to replace the line:

RUN yum install -y cpuid

By this line found in your Dockerfile.bcc.base.arm64

RUN yum install -y python3 python3-pip && yum clean all -y && \
    pip3 install  --no-cache-dir archspec

Because cpuid is not available on ARM.

Next, we build the Dockerfile.libbpf.builder that installs make, git, gcc, rpm-build, systemd and go.

Finally, we tried to build the Dockerfile in the build/ folder. However it crashes during this command:

RUN make build SOURCE_GIT_TAG=$SOURCE_GIT_TAG BIN_TIMESTAMP=$BIN_TIMESTAMP

With the following error message:

[Makefile:191: _build_local] Error 2

We also tried building it from your image: quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0 but we got the exact same error. Do note that Go wasn’t installed on this image and we had to install it.

We also found out that it builds successfully when using one of your image: quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0-go1.18. Consequently, do you know what step we should take to go from the base image to this builder image that would allow us to build Kepler locally from scratch?

simonarys avatar Dec 06 '23 11:12 simonarys

We also tried building it from your image: quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0 but we got the exact same error. Do note that Go wasn’t installed on this image and we had to install it.

$ podman run -it --rm  quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0 sh
sh-5.1# go version
go version go1.20.10 linux/amd64

i can see golang in builder image

vimalk78 avatar Dec 06 '23 11:12 vimalk78

I have been able to build aarch image for kepler, but that is without CPUID. though i have not tested it.

vimalk78 avatar Dec 06 '23 12:12 vimalk78

Indeed, you're right. Go is installed and the error is the following:

go: cannot find GOROOT directory: /usr/local/go

Thus re-installing Go into the /usr/local/go folder fixed the error for us, sorry for the confusion.

Since Go is already installed, we now had to replace the GOROOT path from usr/local/go to /lib/golang on line 11: ENV GOPATH=/opt/app-root GO111MODULE=off GOROOT=/lib/golang The path is now found when building the Dockerfile. However we are facing a new issue:

41.67 github.com/sustainable-computing-io/kepler/pkg/manager
42.22 command-line-arguments
44.56 # command-line-arguments
44.56 /lib/golang/pkg/tool/linux_amd64/link: running clang failed: exit status 1
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 
44.77 make: *** [Makefile:191: _build_local] Error 1

simonarys avatar Dec 06 '23 13:12 simonarys

GOROOT is already defined in the image

sh-5.1# go env | grep ROOT
GOROOT="/usr/lib/golang"

vimalk78 avatar Dec 06 '23 15:12 vimalk78

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 04 '24 16:02 stale[bot]

@vimalk78 , is this issue been fixed?

SamYuan1990 avatar Feb 12 '24 11:02 SamYuan1990