bpftrace icon indicating copy to clipboard operation
bpftrace copied to clipboard

RFC: Add remediation for BPF ABBA deadlock bug caused by hash map access

Open jordalgo opened this issue 1 year ago • 6 comments

There was an issue discovered recently via a bpftrace script whereby a kfunc was attached on the same path used to access BPF_MAP_TYPE_HASH which created an ABBA deadlock in the kernel and a subsequent crash e.g.

kfunc:queued_spin_lock_slowpath /@pagefault_start[tid]/
{
	@spinlock_start[tid] = nsecs;
}

@spinlock_start was being accessed by another bpf prog on a different CPU.

Example stack trace:

#0  bpf_trampoline_6442467238_1+0x74/0x1000
#1  queued_spin_lock_slowpath+0x5/0x200
#2  _raw_spin_lock_irqsave+0x4e/0x60
#3  htab_map_update_elem+0xc7/0x4b0
#4  bpf_prog_717c51531607df92_kretfunc_vmlinux_down_read+0x180/0x234
#5  bpf_trampoline_6442467155_1+0xa1/0x1000
#6  down_read+0x5/0x10
#7  do_user_addr_fault+0x294/0x740
#8  exc_page_fault+0x5e/0x110
#9  asm_exc_page_fault+0x1e/0x30

This issue is being addressed in a bpf kernel patch however because this is going into a later kernel release we should consider adding a temp fix for this in bpftrace.

One possible option, suggested by Alexei, is for a bpftrace script to create a per-cpu variable that is tested when a functional block is called. If it is already set, we exit early (and increment a missed counter?), if it is not yet set, set it, run the functional block, and clear it at exit time. We could do this conditionally if we detect that the prog is accessing a non per-cpu map type.

Another option would be to block use of certain kfuncs/kprobe if bpftrace can detect if these progs are accessing non per-cpu maps in a script with other progs accessing this map (which may end up being more code).

This issue is meant to be a place to discuss possible solutions and the priority of this fix.

jordalgo avatar Apr 29 '24 00:04 jordalgo

The things I'm not keen on in the proposed bpftrace mitigations are:

  1. We'll have no way to determine if the running kernel has the fixes applied or is vulnerable to this deadlock
  2. Therefore the bpftrace mitigation will have to be applied to all scripts going forwards
  • The per-cpu variable method will result in a performance overhead for all scripts
  • Blocking certain probes will mean a loss in functionality

ajor avatar May 01 '24 18:05 ajor

Yeah, agreed. But if there was a way to feature probe, then at least we could delete the mitigation later.

danobi avatar May 01 '24 21:05 danobi

It is worth keeping in mind that the bpf lock change being made upstream does not prevent ABBA deadlocks, but only backs out of them after the code fails to acquire a lock.

It remains to be seen whether the approach of placing the entire bpftrace script in the same recursion prevention domain will be faster or slower than relying on the deadlock detection code in the new bpf lock code.

rikvanriel avatar May 31 '24 14:05 rikvanriel

One possible option, suggested by Alexei, is for a bpftrace script to create a per-cpu variable that is tested when a functional block is called. If it is already set, we exit early (and increment a missed counter?), if it is not yet set, set it, run the functional block, and clear it at exit time. We could do this conditionally if we detect that the prog is accessing a non per-cpu map type.

Would this workaround prevent tracing of recursive functions except at the topmost level?

janet-campbell avatar Jun 01 '24 03:06 janet-campbell

Would this workaround prevent tracing of recursive functions except at the topmost level?

No, it's only limited to a small subset of functions. Here is the PR: https://github.com/bpftrace/bpftrace/pull/3206

jordalgo avatar Jun 01 '24 18:06 jordalgo

It is worth keeping in mind that the bpf lock change being made upstream does not prevent ABBA deadlocks, but only backs out of them after the code fails to acquire a lock.

Weren't the deadlocks caused by attaching a probe to a locking function (queued_spin_lock_slowpath) which was itself called from a BPF program? If BPF maps are changed to use their own non-traceable locking primitives, then the kernel's regular locking functions won't get called by BPF programs and there should be no risk of deadlocks when tracing them, right?

ajor avatar Jun 06 '24 14:06 ajor

Fixed here: https://github.com/bpftrace/bpftrace/pull/3206

jordalgo avatar Jul 30 '24 01:07 jordalgo