kepler icon indicating copy to clipboard operation
kepler copied to clipboard

chore(bpfassets): Refactor to reduce the API exposed by bpfassets

Open dave-tucker opened this issue 9 months ago • 1 comments

Currently pkg/bpfassets contains many global variables that are accessed from many other packages. This makes it hard for both the bpfAssets package and other packages that depend on it to be effectively unit tested. To solve this, I've shrunk the API of be based on:

type Attacher interface {
	HardwareCountersEnabled() bool
	Detach()
	CollectProcesses() ([]ProcessBPFMetrics, error)
	CollectCPUFreq() (map[int32]uint64, error)
	GetEnabledBPFHWCounters() []string
	GetEnabledBPFSWCounters() []string
}

Where NewAttacher() will provide a new eBPF attacher and NewMockAttacher() is used for unit testing (dependency injection).

In un-picking access to globals it transpired that a HwCountersEnabled() was being accessed from many places in the codebase. It defaults to true. Once the eBPF probes have been loaded we mark it to false if we're unable to open perf events.

In order to keep this as consistent as possible, I've changed the program start ordering to load the probes first so the HwCountersEnabled bool can be propagated where it needs to go - and we can be sure it won't change after the fact.

Similarly, GetEnabledBPFHWCounters() and GetEnabledBPFSWCounter() were also changing after the probes were loaded, which is now also solved by loading eBPF earlier in the program startup.

The older BCC test has now been fixed to work with libbpf - and will hopefully run in CI.

The last commit in this series fixes test run control to allow you to: make VERBOSE=1 test to run unit tests, bpf test + benchmarks, or to selectively run one of the 3 component parts. e2e tests are excluded from that make target given they will fail without and e2e environment set up.

Requires: #1390 Updates: #1412

dave-tucker avatar May 08 '24 14:05 dave-tucker

The test failure is due to the code that looks up the libbpf artifact path. This has been an issue in my dev environment since day 1 😱 I'll open a separate PR to fix that.

dave-tucker avatar May 09 '24 13:05 dave-tucker

@dave-tucker this is great, can we have this soon so we can include it in the next release on Thu? Thanks

rootfs avatar May 13 '24 22:05 rootfs

Thanks for the review @vimalk78. I've addressed your comments. The only outstanding thing left to clarify is what the "correct" behaviour should be in the case that either of the probes loaded for tracking PageCacheHit cannot be loaded.

dave-tucker avatar May 15 '24 10:05 dave-tucker

in the case that either of the probes loaded for tracking PageCacheHit cannot be loaded.

since we return error in case kepler_sched_switch_trace fails to load, how about returning error in case kepler_write_page_trace or kepler_read_page_trace ?

vimalk78 avatar May 15 '24 11:05 vimalk78

@dave-tucker Thanks for the PR. It looks good to me. I ran the version against the upstream/main using the dev docker-compose today and got "interesting" results. The variance between 2 versions of kepler running (kepler:latest and this PR) is quite a bit.

We may want to investigate further why this is the case without blocking this PR. It may very well be pointing to some other underlying issue in kepler / bpf or even the (docker compose) setup.

image image

I have created a PR for the "enhanced" dashboard - https://github.com/sustainable-computing-io/kepler/pull/1429

sthaha avatar May 15 '24 12:05 sthaha