kepler icon indicating copy to clipboard operation
kepler copied to clipboard

chore(pkg/bpf): Replace libbpfgo with cilium/ebpf

Open dave-tucker opened this issue 1 year ago • 16 comments

cilium/ebpf is a pure Go eBPF package and is used in a number of popular cloud-native projects. The benefits to Kepler are:

  1. Bytecode is built using bpf2go and the C and Go structs are kept in sync automatically
  2. There is no requirement for Cgo anymore and therefore no requirement to have libbpf or libelf installed to compile and/or to be dynamically linked at runtime
  3. Simplified packaging as the correct bytecode is contained within the kepler binary

Overall I'm happy with this change, but there is only one thing that bugs me.

We have to check in the bytecode object files (e.g kepler.bpfel.o) or the Go tooling (go lint/go vet) complains about the missing files. I couldn't reliably get go generate ./... to work to compile these files in CI. This is something which should be relatively easy to fix in the Makefile/CI environment before we cut a release.

Depends-On: #1435 Fixes: #1424

dave-tucker avatar May 15 '24 15:05 dave-tucker

@rootfs @dave-tucker which release is this targeted for?

vimalk78 avatar May 16 '24 06:05 vimalk78

I don't have a target in mind. But I doubt it will be ready for review until Friday at the earliest.

dave-tucker avatar May 16 '24 08:05 dave-tucker

I thought I was close, but I think I found an issue in #1443 that causes failures when perf events can't be opened for some reason. I've cherry-picked that here to see if it makes CI green - it applies to the libbpfgo code to so it's another PR.

dave-tucker avatar May 16 '24 14:05 dave-tucker

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Key Modifications:

  1. Dependency updates: The Kepler project has replaced libbpfgo with cilium/ebpf, eliminating Cgo, libbpf, and libelf dependencies.
  2. Bytecode object files: The project now uses bytecode object files, which need to be checked in.
  3. Internal changes: Some internal functions and structures have been removed or modified, which might affect dependent code.

Impact on Codebase:

  • The external interface and behavior of the code remain largely unchanged.
  • The removal of Cgo, libbpf, and libelf dependencies simplifies the project's dependencies.
  • The introduction of bytecode object files may require adjustments to the build process or testing.

Observations and Suggestions:

  • It's essential to review the internal changes and ensure that dependent code is updated accordingly to avoid potential breakages.
  • Consider adding documentation or comments to explain the reasoning behind the dependency updates and internal changes.
  • Verify that the updated .pre-commit-config.yaml file correctly excludes the bpf/include/ directory to prevent unnecessary checks.

Overall, this pull request refactors the Kepler project's dependencies and internal implementation, which may require careful review and testing to ensure a smooth transition.

github-actions[bot] avatar May 20 '24 14:05 github-actions[bot]

@dave-tucker we may want to look at this crash before merging. output of make build


❯ sudo ./_output/bin/kepler

I0522 09:33:18.758775 2437499 gpu.go:38] Trying to initialize GPU collector using dcgm
W0522 09:33:18.758911 2437499 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0522 09:33:18.758976 2437499 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0522 09:33:18.758997 2437499 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0522 09:33:18.759008 2437499 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0522 09:33:18.759111 2437499 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0522 09:33:18.759121 2437499 gpu.go:38] Trying to initialize GPU collector using dummy
I0522 09:33:18.759134 2437499 gpu.go:42] Using dummy to obtain gpu power
I0522 09:33:18.760886 2437499 exporter.go:85] Kepler running on version: v0.7.10-5-gf1097c0e2c498-dirty
I0522 09:33:18.760897 2437499 config.go:283] using gCgroup ID in the BPF program: true
I0522 09:33:18.760963 2437499 config.go:285] kernel version: 6.7
I0522 09:33:18.761045 2437499 config.go:310] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0522 09:33:18.761127 2437499 redfish.go:169] failed to get redfish credential file path
I0522 09:33:18.761632 2437499 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0522 09:33:18.763071 2437499 exporter.go:94] Number of CPUs: 8
I0522 09:33:19.257525 2437499 watcher.go:67] Using in cluster k8s config
I0522 09:33:19.257541 2437499 watcher.go:74] failed to get config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
I0522 09:33:19.257553 2437499 watcher.go:126] k8s APIserver watcher was not enabled
I0522 09:33:19.257642 2437499 prometheus_collector.go:96] Registered Container Prometheus metrics
I0522 09:33:19.257688 2437499 prometheus_collector.go:101] Registered VM Prometheus metrics
I0522 09:33:19.257724 2437499 prometheus_collector.go:105] Registered Node Prometheus metrics
I0522 09:33:19.257826 2437499 node_platform_energy.go:54] Failed to create Regressor/AbsPower Power Model to estimate Node Platform Power: open /var/lib/kepler/data/acpi_AbsPowerModel.json: no such fi
le or directory
I0522 09:33:19.257965 2437499 exporter.go:175] starting to listen on 0.0.0.0:8888
I0522 09:33:19.257980 2437499 exporter.go:181] Started Kepler in 497.120003ms
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x854f72]

goroutine 53 [running]:
github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x18b8388, 0x7}, 0x0)
        kepler/pkg/collector/stats/types/types.go:108 +0x32
github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x16a0640?, 0xc000190150, 0x1?)
        kepler/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:40 +0xc5
github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics({0x1b40fe8, 0xc000692f00}, 0xc0000736d0?)
        kepler/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:129 +0xa39
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc002bfe4c0, 0x0?)
        kepler/pkg/collector/metric_collector.go:191 +0x5d
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc002bfe4c0)
        kepler/pkg/collector/metric_collector.go:155 +0x54
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
        kepler/pkg/collector/metric_collector.go:106 +0x53
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        kepler/pkg/manager/manager.go:74 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
        kepler/pkg/manager/manager.go:66 +0x65

sthaha avatar May 21 '24 23:05 sthaha

I get the same using the compose file

❯ cd manifests/compose/dev
❯ docker compose up --build kepler-dev

kepler-dev-1  | I0521 23:32:42.275436 2436610 process_bpf_collector.go:88] process worker (pid=2202720, cgroup=1466240) has 0 task clock time 2344773 CPU cycles, 883246 instructions, 232766 cache miss
es, 68 page cache hits
kepler-dev-1  | panic: runtime error: invalid memory address or nil pointer dereference
kepler-dev-1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8788f2]
kepler-dev-1  |
kepler-dev-1  | goroutine 30 [running]:
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x18ee6d1, 0x7}, 0x0)
kepler-dev-1  |         /workspace/pkg/collector/stats/types/types.go:108 +0x32
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x16d05e0?, 0xc0003e01c0, 0x219c60?)
kepler-dev-1  |         /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:40 +0xc5
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics({0x1b7aea8, 0xc0005977c0}, 0xc00048aed0?)
kepler-dev-1  |         /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:129 +0xa39
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc002bcc500, 0x0?)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:191 +0x5d
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc002bcc500)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:155 +0x54
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:106 +0x53
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
kepler-dev-1  |         /workspace/pkg/manager/manager.go:74 +0x7b
kepler-dev-1  | created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
kepler-dev-1  |         /workspace/pkg/manager/manager.go:66 +0x65
kepler-dev-1 exited with code 2

sthaha avatar May 21 '24 23:05 sthaha

@sthaha I just force pushed to fix a small bug in the cleanup logic (forgetting to clear the enabledHardwareCounters set). I know that segfault happens when there is a disagreement between what pkg/bpf declares as supported and what pkg/collector/stats uses to populate the map on the Stats struct, but that should have been taken care of in #1443.

I've tried locally on Fedora 40 with both running the binary and compose and can't reproduce. I've also tried on RHEL 9.4 and RHEL 9.2 and still can't reproduce. I'm also not seeing anything in the logs stating that any of the eBPF probes couldn't be loaded 🤔

TL;DR I'm a bit stuck on fixing this as I can't reproduce the issue.

dave-tucker avatar May 22 '24 14:05 dave-tucker

I tried with the latest, commit 6d8539b and runs fine.

sthaha avatar May 23 '24 00:05 sthaha

This is ready to go. I'm holding off doing a rebase though until #1481 has merged so marking this as a draft until then.

dave-tucker avatar Jun 04 '24 18:06 dave-tucker

Snyk checks are failing, but not due to changes in this PR: It seems the are complaining about license issues in yaml.v1

That comes in via:

$ go mod graph | grep gopkg.in/yaml.v1
howett.net/[email protected] gopkg.in/[email protected]

Which is used by:

$ go mod why howett.net/plist       
# howett.net/plist
github.com/sustainable-computing-io/kepler/cmd/validator
github.com/jaypipes/ghw
github.com/jaypipes/ghw/pkg/block
howett.net/plist

The other issues are either in:

  • Packages that aren't in our dependency tree: i.e gogs.io/gogs
  • Dependencies of our dependencies - most of which pre-date this PR

dave-tucker avatar Jun 10 '24 14:06 dave-tucker

Snyk checks are failing, but not due to changes in this PR: It seems the are complaining about license issues in yaml.v1

I can't see the details of the failure ... but, I wonder why the other PRs pass CI while this fails 🤔

sthaha avatar Jun 12 '24 07:06 sthaha

Snyk checks are failing, but not due to changes in this PR: It seems the are complaining about license issues in yaml.v1

I can't see the details of the failure ... but, I wonder why the other PRs pass CI while this fails 🤔

I think anything that touches go.mod causes it to rescan dependencies and then fail. See https://github.com/sustainable-computing-io/kepler/pull/1516

dave-tucker avatar Jun 12 '24 09:06 dave-tucker

@dave-tucker can you sync this PR with main?

marceloamaral avatar Jun 14 '24 08:06 marceloamaral

@dave-tucker can you sync this PR with main?

Done!

dave-tucker avatar Jun 14 '24 08:06 dave-tucker

LGTM

sthaha avatar Jun 18 '24 00:06 sthaha

Per 07/02 meeting, this PR has many usability improvements. @SamYuan1990 will check with s390 team for last check.

rootfs avatar Jul 02 '24 13:07 rootfs

@SamYuan1990 will check with s390 team for last check.

Hi @SamYuan1990, any update? We'd like to get this merged before we cut the 0.7.11 release later this week.

dave-tucker avatar Jul 08 '24 13:07 dave-tucker

I tested this one last time on my machine using the dev compose and pretty happy with the result.

image

sthaha avatar Jul 08 '24 23:07 sthaha

@SamYuan1990 I am going to merge this in the interest of making the release tomorrow and to avoid rebases. I have tested this on on x86_64 worked well. Please let us know if you find any issue and we can revert the commit

sthaha avatar Jul 09 '24 23:07 sthaha