kepler
kepler copied to clipboard
chore(bpfassets): Refactor to reduce the API exposed by bpfassets
Currently pkg/bpfassets
contains many global variables that are accessed from many other packages.
This makes it hard for both the bpfAssets
package and other packages that depend on it to be effectively unit tested.
To solve this, I've shrunk the API of be based on:
type Attacher interface {
HardwareCountersEnabled() bool
Detach()
CollectProcesses() ([]ProcessBPFMetrics, error)
CollectCPUFreq() (map[int32]uint64, error)
GetEnabledBPFHWCounters() []string
GetEnabledBPFSWCounters() []string
}
Where NewAttacher()
will provide a new eBPF attacher and NewMockAttacher()
is used for unit testing (dependency injection).
In un-picking access to globals it transpired that a HwCountersEnabled()
was being accessed from many places in the codebase. It defaults to true
. Once the eBPF probes have been loaded we mark it to false
if we're unable to open perf events.
In order to keep this as consistent as possible, I've changed the program start ordering to load the probes first so the HwCountersEnabled
bool can be propagated where it needs to go - and we can be sure it won't change after the fact.
Similarly, GetEnabledBPFHWCounters()
and GetEnabledBPFSWCounter()
were also changing after the probes were loaded, which is now also solved by loading eBPF earlier in the program startup.
The older BCC test has now been fixed to work with libbpf - and will hopefully run in CI.
The last commit in this series fixes test run control to allow you to:
make VERBOSE=1 test
to run unit tests, bpf test + benchmarks, or to selectively run one of the 3 component parts.
e2e tests are excluded from that make target given they will fail without and e2e environment set up.
Requires: #1390 Updates: #1412
The test failure is due to the code that looks up the libbpf artifact path. This has been an issue in my dev environment since day 1 😱 I'll open a separate PR to fix that.
@dave-tucker this is great, can we have this soon so we can include it in the next release on Thu? Thanks
Thanks for the review @vimalk78. I've addressed your comments. The only outstanding thing left to clarify is what the "correct" behaviour should be in the case that either of the probes loaded for tracking PageCacheHit
cannot be loaded.
in the case that either of the probes loaded for tracking
PageCacheHit
cannot be loaded.
since we return error in case kepler_sched_switch_trace
fails to load, how about returning error in case kepler_write_page_trace
or kepler_read_page_trace
?
@dave-tucker Thanks for the PR. It looks good to me. I ran the version against the upstream/main
using the dev
docker-compose today and got "interesting" results. The variance between 2 versions of kepler running (kepler:latest and this PR) is quite a bit.
We may want to investigate further why this is the case without blocking this PR. It may very well be pointing to some other underlying issue in kepler / bpf or even the (docker compose) setup.
I have created a PR for the "enhanced" dashboard - https://github.com/sustainable-computing-io/kepler/pull/1429