hwloc icon indicating copy to clipboard operation
hwloc copied to clipboard

reduce the number of open syscalls getting ENOENT from unexisting caches in sysfs

Open bgoglin opened this issue 4 years ago • 3 comments

We currently try to open /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_map for every PU and Y between 0 and 9. That's usually 6 useless syscalls per PU since most CPUs have 4 caches per PU. That's almost 1ms per PU.

Linux numbers caches from 0 to N-1 internally but some of them might get skip when added to sysfs for some reasons (see cache_add_dev() in drivers/base/cacheinfo.c). That means we have no easy way to break the loop when index4 is missing as usual.

Doing stat on the parent directory might be a good way to find out the total number of indexY subdirectories. That would mean one syscall to avoid 6 syscalls. However btrfs (for fsroot regression tests) has some issues with nlink being wrong (see comments in topology-linux.c).

Reducing to 5 instead of 9 is likely a good start for now. Most current CPUs have 4 caches in sysfs. There are some L4 out there but I have never seen those in sysfs since they are rather outside of the CPUs. Itanium had 5 caches (L2i and L2d) but it's dead. So 5 works fine and gives us one free slot in case newer CPUs bring an additional level.

bgoglin avatar Nov 18 '20 14:11 bgoglin

Perhaps using opendir() to get the actual list could be more efficient even if being an n+1th call? Even with a large directory that ends up with only one getdents64() system call.

sthibaul avatar Nov 18 '20 14:11 sthibaul

The easiest solution would be to reduce the number of iterations\and use the opendir() function for efficient directory listing is a promising approach. It would lead to a reduction in unnecessary syscalls and enhance the performance of Open MPI's cache information retrieval process on Linux.

xWuWux avatar Oct 19 '23 21:10 xWuWux

I did a quick test. We actually get more syscalls using opendir. Instead of having one useless openat() for each of the 6 non-existing caches (those failing openat are likely very cheap), opendir+readdirs+closedir uses 7 syscalls (openat+newfstatat+2fnctl+2getdents+close). That's for each core.

If you want to play with it, the code is in PR #629. There will be a tarball at https://ci.inria.fr/hwloc/job/basic/job/PR-629/ soon.

bgoglin avatar Oct 20 '23 11:10 bgoglin