aya
aya copied to clipboard
aya spends an unexpected amount of time doing fs operations in userspace `PerCpuArray::try_from`
I was perf debugging my program and came across the following:
On the left, you can see that the majority of PerCpuArray::set
's time is spent in nr_cpus
, which I believe is due to this call in PerCpuArray::try_from
:
https://github.com/aya-rs/aya/blob/f75d968657d8cf66afe744591c350cb7f52ef83e/aya/src/maps/mod.rs#L491
which then eventually reads from the FS:
https://github.com/aya-rs/aya/blob/c5a10f8fbe1c2b27651f9b11d077399612b8318f/aya/src/util.rs#L43
Since, AFAIK, this is the only way to update a PerCpuArray
(i.e., there's no way to skip the bounds check and therefore nr_cpu
s call), doing a FS operation every single time you update your array seems like a fairly significant performance issue.
Nice catch!
I just looked at the relevant kernel code, and here's where the input percpu values are copied https://github.com/torvalds/linux/blob/a52171ae7b803f4587b8172d1768313b4d093d0a/kernel/bpf/syscall.c#L1141.
The kernel will return -EFAULT
if the size of the values doesn't match the expected size. I think we should:
- deprecate
PerCpuValues::try_from
- add
PerCpuValues::new(Vec<T>)
(not::from(Vec<T>)
because I think we might still want to add checks for alignment etc in the future, andfrom
shouldn't fail) - make sure
-EFAULT
is propagated correctly and you get aMapError::SyscallError
if you provide values of the wrong size - Update the documentation to explain what happens if you provide values of the wrong size
PerCpuValues::alloc_kernel_mem
has the same issue. I'm just going to refactor the whole thing I think :) - unless you want to take this on, which also works of course!
The try_from
→new
refactor seems like something I might be able to do, but the alloc_kernel_mem
refactor seems a bit more daunting. :sweat_smile:
I gave it a quick shot, but it seems like figuring out what to do with the read half (PerCpuArray::get
→ bpf_map_lookup_elem_per_cpu
→ PerCpuValues::alloc_kernel_mem
) is the hard bit.
Should we make the caller of PerCpuArray::get
specify the number of CPUs, or is it possible to infer it somehow from BPF_MAP_LOOKUP_ELEM
?
(And if one tries to go the former route, then things like PerCpuArray::iter
would also need to be updated to specify the number of CPUs, which seems like an uninuitive interface. And I think implementing IterableMap
is not possible?)
Ok I had another look, and from https://www.kernel.org/doc/Documentation/core-api/cpu_hotplug.rst it looks like the value returned from possible_cpus()
and therefore nr_cpus()
doesn't change once booted.
So now I think we should leave everything as is, and just cache the value returned by possible_cpus()
using lazy-static. How does that sound?
Then it also looks like PerfCpuValues
could be refactored to be a proper collection - instead of being a wrapper around Vec<T>
- so that it could store items directly at 8-byte aligned addresses and skip a copy on get/insert. That could be done separately tho.
That does indeed sound much better, thanks for looking into whether it's safely cachable or not.
I tried out the lazy-static idea here, https://github.com/djahandarie/aya/commit/b171057ec050db03e1e57b5c4cd6e4bcb098c89e. It requires a little bit of ugliness due to io::Error
not being Clone-able, but it does at least have the advantage that it maintains the current API. Feel free to work off of that or provide an idea to refactor the ugliness away, would be happy to try.
I don't think I can do the second idea without more guidance, but it sounds great.
I tried out the lazy-static idea here, djahandarie@b171057. It requires a little bit of ugliness due to
io::Error
not being Clone-able, but it does at least have the advantage that it maintains the current API. Feel free to work off of that or provide an idea to refactor the ugliness away, would be happy to try.
I swear this io::Error
thing comes up once a week across all my rust projects 😂 I'm fine with it, these are utility functions anyway I don't see anyone ever implementing fine grained error handling for them.
The patch looks good, but I think we should cache both possible_cpus()
and nr_cpus()
?
I don't think I can do the second idea without more guidance, but it sounds great.
I think the goal would be to make it possible to implement your use case - retrieve a large number of items - allocating only one instance of PerCpuValues
and repeatedly fill it with get()
calls. Right now each get()
allocates + copies which isn't ideal.
How exactly it should work I'm not sure, I guess we'll have to experiment a bit and see what works. I'm happy to do this if it feels like too much work!