`haveXYZ` feature probes are inconclusive
A Cilium contributor (@jrajahalme) tried running Cilium's BPF unit tests on an aarch64 kernel and received the following error:
--- FAIL: TestBPF (0.00s)
bpf_test.go:149: new coll: map test_cilium_ipcache: map create: prealloc maps not supported (requires >= v4.6)
...
$ uname -a
Linux dev 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:27:01 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
This failure is not expected, but most haveXYZ() probes like haveNoPreallocMaps() simply do if err != nil { return internal.ErrNotSupported }, which masks any unexpected conditions like ENOMEM or EPERM.
While this may partly be by design, I suggest we make these probes as conclusive as possible (e.g. only EINVAL -> ErrNotSupported) and return any other unexpected errors verbatim to ease debugging.
Agreed. At the same time we could think about moving the checks into the syscall error path. Basically if err != nil && mapFlagPrealloc && !havePrealloc { return havePrealloc() }
I'd like to help with this too :)
Cool! There are two tasks here:
- Audit callers of
internal.NewFeatureTest. Find ones where we doif err != nil { return ErrNotSupported }. Run the questionable feature tests on your machine. What error do they return? Then, trace the code path taken in the kernel. Is there a way to ensure we always hit a certain error? It's probably most useful to look at the commit which introduced a kernel feature. The goal is to replace theerr != nilcondition with something likeerrors.Is(err, EBADF)or so. Note thaterrors.Is(err, EINVAL)is basically the same as checkingerr != nil, since EINVAL is such a common error that it's hard to infer anything. - Audit callers of
haveXZY()in the code base. We sometimes have code paths likeif haveXYZ() != nil { return err }. This pattern is problematic, since a bug in the feature detection (for example, a new kernel changes behaviour) turns into a hard user error. Instead, it's better to attempt the syscall. If it fails, run a bunch of feature tests to determine the most likely culprit. We also have cases where we do something likeif haveXYZ() != nil { modify behaviour to avoid error }. These don't need changing.
I think that (2) is the easier, more rewarding task.
Alright looking at this now, I want to clarify a few things:
- Looks like the ultimate goal is to return proper errors to the client? For example, the user could be hitting an
EPERMand does not necessarily mean the client's kernel isn't supported. In that case, what benefit does looking for specific errors in the linux codepath give us?
Because per the example above, we are going to do something like:
if errors.Is(err, EBADF){ return err } for every possible client error seen after tracing in the kernel? Failing to see the difference between that & returning any error that isn't EINVAL,
Looking through the source, for example, I see a feature test which performs a sys.BPF system call, and moving through the linux source, I see possible client errors like EPERM, EFAULT which could occur during that syscall. These look like resource errors and even if handled by if errors.Is(err, xxx), eventually we basically are planning to just return them, aren't we? @lmb
Looks like the ultimate goal is to return proper errors to the client?
The goal of (1) is to make feature detection more reliable. The outcome of a feature test can be either conclusive (supported / not supported) or inconclusive. By doing err != nil { return ErrNotSupported } we say that there is never an inconclusive result. Most of the time this is incorrect: our test could trigger errors for a variety of reasons. If we can replace err != nil with error.Is(...) { return ErrNotSupported } else if err != nil { return err } we can discern between conclusive and inconclusive errors.
Looks like the ultimate goal is to return proper errors to the client?
The goal of (1) is to make feature detection more reliable. The outcome of a feature test can be either conclusive (supported / not supported) or inconclusive. By doing
err != nil { return ErrNotSupported }we say that there is never an inconclusive result. Most of the time this is incorrect: our test could trigger errors for a variety of reasons. If we can replaceerr != nilwitherror.Is(...) { return ErrNotSupported } else if err != nil { return err }we can discern between conclusive and inconclusive errors.
Thanks for the explanation, that's how I handled it actually, currently looking at (2)