ebpf `haveXYZ` feature probes are inconclusive

A Cilium contributor (@jrajahalme) tried running Cilium's BPF unit tests on an aarch64 kernel and received the following error:

--- FAIL: TestBPF (0.00s)
    bpf_test.go:149: new coll: map test_cilium_ipcache: map create: prealloc maps not supported (requires >= v4.6)
...
$ uname -a
Linux dev 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:27:01 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

This failure is not expected, but most haveXYZ() probes like haveNoPreallocMaps() simply do if err != nil { return internal.ErrNotSupported }, which masks any unexpected conditions like ENOMEM or EPERM.

While this may partly be by design, I suggest we make these probes as conclusive as possible (e.g. only EINVAL -> ErrNotSupported) and return any other unexpected errors verbatim to ease debugging.

Dec 16 '22 10:12 ti-mo

Agreed. At the same time we could think about moving the checks into the syscall error path. Basically if err != nil && mapFlagPrealloc && !havePrealloc { return havePrealloc() }

Dec 17 '22 10:12 lmb

I'd like to help with this too :)

May 02 '23 10:05 kwakubiney

Cool! There are two tasks here:

Audit callers of internal.NewFeatureTest. Find ones where we do if err != nil { return ErrNotSupported }. Run the questionable feature tests on your machine. What error do they return? Then, trace the code path taken in the kernel. Is there a way to ensure we always hit a certain error? It's probably most useful to look at the commit which introduced a kernel feature. The goal is to replace the err != nil condition with something like errors.Is(err, EBADF) or so. Note that errors.Is(err, EINVAL) is basically the same as checking err != nil, since EINVAL is such a common error that it's hard to infer anything.
Audit callers of haveXZY() in the code base. We sometimes have code paths like if haveXYZ() != nil { return err }. This pattern is problematic, since a bug in the feature detection (for example, a new kernel changes behaviour) turns into a hard user error. Instead, it's better to attempt the syscall. If it fails, run a bunch of feature tests to determine the most likely culprit. We also have cases where we do something like if haveXYZ() != nil { modify behaviour to avoid error }. These don't need changing.

I think that (2) is the easier, more rewarding task.

May 05 '23 08:05 lmb

Alright looking at this now, I want to clarify a few things:

Looks like the ultimate goal is to return proper errors to the client? For example, the user could be hitting an EPERM and does not necessarily mean the client's kernel isn't supported. In that case, what benefit does looking for specific errors in the linux codepath give us?

Because per the example above, we are going to do something like: if errors.Is(err, EBADF){ return err } for every possible client error seen after tracing in the kernel? Failing to see the difference between that & returning any error that isn't EINVAL,

Looking through the source, for example, I see a feature test which performs a sys.BPF system call, and moving through the linux source, I see possible client errors like EPERM, EFAULT which could occur during that syscall. These look like resource errors and even if handled by if errors.Is(err, xxx), eventually we basically are planning to just return them, aren't we? @lmb

May 24 '23 22:05 kwakubiney

Looks like the ultimate goal is to return proper errors to the client?

The goal of (1) is to make feature detection more reliable. The outcome of a feature test can be either conclusive (supported / not supported) or inconclusive. By doing err != nil { return ErrNotSupported } we say that there is never an inconclusive result. Most of the time this is incorrect: our test could trigger errors for a variety of reasons. If we can replace err != nil with error.Is(...) { return ErrNotSupported } else if err != nil { return err } we can discern between conclusive and inconclusive errors.

May 30 '23 12:05 lmb

Looks like the ultimate goal is to return proper errors to the client?

The goal of (1) is to make feature detection more reliable. The outcome of a feature test can be either conclusive (supported / not supported) or inconclusive. By doing err != nil { return ErrNotSupported } we say that there is never an inconclusive result. Most of the time this is incorrect: our test could trigger errors for a variety of reasons. If we can replace err != nil with error.Is(...) { return ErrNotSupported } else if err != nil { return err } we can discern between conclusive and inconclusive errors.

Thanks for the explanation, that's how I handled it actually, currently looking at (2)

May 30 '23 14:05 kwakubiney