Make per-cpu scratch buffers preemption-safe
#412 turns out to be buggy because per-cpu maps on kprobe aren't safe: https://lore.kernel.org/bpf/CAMy7=ZWPc279vnKK6L1fssp5h7cb6cqS9_EuMNbfVBg_ixmTrQ@mail.gmail.com/T/
@dylandreimerink suggested the following:
An example of where this could go wrong for kprobes would be if you probe a syscall in an environment where eBPF can be preemted:
Process 1 calls the open syscall
Kprobe on open syscall runs
Preemption happens
Process 2 is scheduled on the same CPU
Process 2 calls the open syscall
Kprobe on open syscall runs
Now we have 2 instantiations of your eBPF running concurrently on the same CPU. And thus write to the same per-CPU element of the same key.The suggested fix (according to the mailing list thread) is to do something like:
struct {
__uint(type, BPF_MAP_TYPE_PER_CPU_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, SCRATCH_SIZE);
__uint(max_entries, 3); // not 1 like we usually do
} buf;
volatile u32 call_counter = 0;
SEC("kprobe/whatever")
int my_program(struct pt_regs *ctx)
{
u32 index = __sync_fetch_and_add(&call_counter, 1);
void *scratch_buf = bpf_map_lookup_elem(&buf, &index);
if (!scratch_buf) {
// No more scratch buffers left
return 0;
}
// Do work with scratch buffer, and goto exit on error.
exit:
__sync_fetch_and_sub(&call_counter, 1);
return 0;
}
Then he said that it's global atomic is buggy:
It has to be an atomic per core for that to work
per-cpu maps on kprobe aren't safe
Just adding some detail since I don't see it mentioned.
This isn't just limited to kprobes, it all BPF programs where multiple "instances" may run concurrently on the same CPU. As the mailing list mentions since effectively v5.11 an instance of an eBPF program is pinned to the same CPU instead of having preemption disabled so but multiple instances could be interleaved.
When that happens is tricky since it depends on the preemption model the kernel is compiled with, the program type and where exactly the program is attached.
If the kernel is compiled with only voluntary kernel preemption, then an eBPF program is essentially only interrupted by a CPU interrupt. So is only an issue if your program happens to attach to code that can be called from an interrupt context. On kernels with kernel preemption, the kernel can preempt itself in the middle of an eBPF program.
Its not an issue for XDP or TC programs for example since the network stack has one worker per CPU to process packets. For kprobes/LSM hooks/fprobes attached to network logic which executes in such a context it does not make that much of a difference. But in most other places where the context is a syscall or some sort of other kernel worker/thread that may have multiple instances per CPU, its a danger.
The above suggestion is based on discussion in the exact same mailing list thread.
interrupted the first one. But even then, I will need to find a way to know if my program currently interrupts the run of another program - is there a way to do that?
May be a percpu atomic counter to see if the bpf prog has been re-entered on the same cpu.
Since we always add new events to Tracee, this will also not be scalable. Yet, if there is no other solution, I believe we will go in that direction
[1] https://github.com/aquasecurity/tracee/blob/main/pkg/ebpf/c/tracee.bpf.c
you're talking about BPF_PERCPU_ARRAY(scratch_map, scratch_t, 1); ?
Insead of scratch_map per program, use atomic per-cpu counter for recursion. You'll have 3 levels in the worst case. So it will be: BPF_PERCPU_ARRAY(scratch_map, scratch_t, 3); On prog entry increment the recursion counter, on exit decrement. And use that particular scratch_t in the prog.
It has to be an atomic per core for that to work
Yea. Global variables are shared between all program instances, so you would need an array large enough to track concurrent instances per CPU. Could simply be a [256]int or even a RESIZABLE_ARRAY if you want to get fancy. Then use the CPU ID to index into it.