aya aya spends an unexpected amount of time doing fs operations in userspace `PerCpuArray::try

I was perf debugging my program and came across the following:

aya-flamegraph

On the left, you can see that the majority of PerCpuArray::set's time is spent in nr_cpus, which I believe is due to this call in PerCpuArray::try_from: https://github.com/aya-rs/aya/blob/f75d968657d8cf66afe744591c350cb7f52ef83e/aya/src/maps/mod.rs#L491

which then eventually reads from the FS:

https://github.com/aya-rs/aya/blob/c5a10f8fbe1c2b27651f9b11d077399612b8318f/aya/src/util.rs#L43

Since, AFAIK, this is the only way to update a PerCpuArray (i.e., there's no way to skip the bounds check and therefore nr_cpus call), doing a FS operation every single time you update your array seems like a fairly significant performance issue.

Apr 01 '22 22:04 djahandarie

Nice catch!

I just looked at the relevant kernel code, and here's where the input percpu values are copied https://github.com/torvalds/linux/blob/a52171ae7b803f4587b8172d1768313b4d093d0a/kernel/bpf/syscall.c#L1141.

The kernel will return -EFAULT if the size of the values doesn't match the expected size. I think we should:

deprecate PerCpuValues::try_from
add PerCpuValues::new(Vec<T>) (not ::from(Vec<T>) because I think we might still want to add checks for alignment etc in the future, and from shouldn't fail)
make sure -EFAULT is propagated correctly and you get a MapError::SyscallError if you provide values of the wrong size
Update the documentation to explain what happens if you provide values of the wrong size

Apr 02 '22 00:04 alessandrod

PerCpuValues::alloc_kernel_mem has the same issue. I'm just going to refactor the whole thing I think :) - unless you want to take this on, which also works of course!

Apr 02 '22 03:04 alessandrod

The try_from→new refactor seems like something I might be able to do, but the alloc_kernel_mem refactor seems a bit more daunting. :sweat_smile:

Apr 03 '22 04:04 djahandarie

I gave it a quick shot, but it seems like figuring out what to do with the read half (PerCpuArray::get → bpf_map_lookup_elem_per_cpu → PerCpuValues::alloc_kernel_mem) is the hard bit.

Should we make the caller of PerCpuArray::get specify the number of CPUs, or is it possible to infer it somehow from BPF_MAP_LOOKUP_ELEM?

(And if one tries to go the former route, then things like PerCpuArray::iter would also need to be updated to specify the number of CPUs, which seems like an uninuitive interface. And I think implementing IterableMap is not possible?)

Apr 04 '22 18:04 djahandarie

Ok I had another look, and from https://www.kernel.org/doc/Documentation/core-api/cpu_hotplug.rst it looks like the value returned from possible_cpus() and therefore nr_cpus() doesn't change once booted.

So now I think we should leave everything as is, and just cache the value returned by possible_cpus() using lazy-static. How does that sound?

Then it also looks like PerfCpuValues could be refactored to be a proper collection - instead of being a wrapper around Vec<T> - so that it could store items directly at 8-byte aligned addresses and skip a copy on get/insert. That could be done separately tho.

Apr 05 '22 00:04 alessandrod

That does indeed sound much better, thanks for looking into whether it's safely cachable or not.

I tried out the lazy-static idea here, https://github.com/djahandarie/aya/commit/b171057ec050db03e1e57b5c4cd6e4bcb098c89e. It requires a little bit of ugliness due to io::Error not being Clone-able, but it does at least have the advantage that it maintains the current API. Feel free to work off of that or provide an idea to refactor the ugliness away, would be happy to try.

I don't think I can do the second idea without more guidance, but it sounds great.

Apr 05 '22 04:04 djahandarie

I tried out the lazy-static idea here, djahandarie@b171057. It requires a little bit of ugliness due to io::Error not being Clone-able, but it does at least have the advantage that it maintains the current API. Feel free to work off of that or provide an idea to refactor the ugliness away, would be happy to try.

I swear this io::Error thing comes up once a week across all my rust projects 😂 I'm fine with it, these are utility functions anyway I don't see anyone ever implementing fine grained error handling for them.

The patch looks good, but I think we should cache both possible_cpus() and nr_cpus()?

I don't think I can do the second idea without more guidance, but it sounds great.

I think the goal would be to make it possible to implement your use case - retrieve a large number of items - allocating only one instance of PerCpuValues and repeatedly fill it with get() calls. Right now each get() allocates + copies which isn't ideal.

How exactly it should work I'm not sure, I guess we'll have to experiment a bit and see what works. I'm happy to do this if it feels like too much work!

Apr 05 '22 06:04 alessandrod

aya
aya copied to clipboard

aya spends an unexpected amount of time doing fs operations in userspace `PerCpuArray::try_from`

aya aya copied to clipboard

aya spends an unexpected amount of time doing fs operations in userspace `PerCpuArray::try_from`

aya
aya copied to clipboard