[Bug] - kernel backport request for 6.12: rqspinlocks for BPF subsystem
Describe the bug
It would be great if you could consider backporting the merged resilient queued spin lock patch series into your v6.12 stable kernel:
https://lore.kernel.org/all/[email protected]/
This fixes an issue in the BPF subsystem where hashtable map insertions from the BPF program fail when concurrent updates to that map occur from user space via bpf(2) syscall.
The core of the issue is this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=20b6cc34ea74b6a84599c1f8a70f3315b56a1883 This fix was added to avoid a deadlock situation in particular with tracing: If a hashtab is accessed in both non-NMI and NMI context, the system may deadlock on bucket->lock.
Therefore, it was fixed with a percpu counter "map_locked". "map_locked" rejects concurrent access to the same bucket from the same CPU. To reduce memory overhead, the "map_locked" was not added per bucket. Instead, 8 per-CPU counters were added to each hashtab. buckets are assigned to these counters based on the lower bits of its hash. This commit was first added in v5.11 kernel (v5.10 stable does not have it).
When both user space (syscall) and BPF kernel side were using a regular spinlock on bucket->lock, which does not abort, but both spin until they get ownership of the hashtable bucket.
After the fix, when concurrent access hashed to the same "map_locked" happens the lock cannot be taken and it bails out with -EBUSY given the one who won the race previously bumped that counter:
hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
migrate_disable();
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
__this_cpu_dec(*(htab->map_locked[hash]));
migrate_enable();
return -EBUSY;
}
This code got completely reworked in v6.15 kernel onwards through the conversion to rqspinlock: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/kernel/bpf/hashtab.c?h=linux-6.15.y&id=4fa8d68aa53e6d76f66f3ed21e06c52cf8912074 (full series https://lore.kernel.org/all/[email protected]/). Resilient Queued Spin Locks can handle contention case internally, and thus the whole "map_locked" counter got completely removed.
Related Cilium issue users are facing: https://github.com/cilium/cilium/issues/35010
It would be awesome if you could consider backporting the rqspinlock for BPF so this can be fixed.
Cc @awsthk
Thanks for opening this issue, we will assess and see what best we can do here. thanks.