bcc icon indicating copy to clipboard operation
bcc copied to clipboard

EOPNOTSUPP error at bpf_perf_event_output function

Open geonheec opened this issue 5 years ago • 5 comments

I'm trying to make the c-based eBPF program.

I attach the kprobe on bio_endio function. Then I tried to pass my structure to user space. (I follow the samples/bpf/trace_output example.)

but bpf_perf_event_output returns -EOPNOTSUPP Error.

bpf_perf_event_output(ctx, &result_map, 0, &result, sizeof(result);

result structure consists of (five u64 var + five u32 var + one char[16]).

struct bpf_map_def SEC("maps") result_map = { .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, .key_size = sizeof(int), .value_size = sizeof(u32), .max_entries = 2, };

If you know the solution, please help me.

geonheec avatar Apr 03 '20 05:04 geonheec

probable cause is linux code

static __always_inline u64
__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
			u64 flags, struct perf_sample_data *sd)
{
	.........
	
	if (unlikely(event->oncpu != cpu))
		return -EOPNOTSUPP;

	return perf_event_output(event, sd, regs);
}

as per code this case is unlikely. i,e, event->oncpu != cpu

can you please post any snippet to reproduce this? also please post kernel version.


Alternatively as per kernel commit a43eec304

User space needs to perf_event_open() it (either for one or all cpus) and store FD into perf_event_array (similar to bpf_perf_event_read() helper) before eBPF program can send data into it.

do you have perf_event_open called in user space ? which can be a most likely case.

how it is done?

as in code hello_perf_output.py perf_submit ultimately calls bpf_perf_event_output. perf buffer is opened and polled i.e.

b["events"].open_perf_buffer(print_event)
........
b.perf_buffer_poll()
``

devidasjadhav avatar Apr 03 '20 10:04 devidasjadhav

Thanks for answer.

can you please post any snippet to reproduce this?

:Can you refer to the trace_ouput_kern.c & trace_output_user.c in linux/samples/bpf at kernel v5.2 ?? (I'm using kernel v5.2.)

My code is almost same as the example. The structure that is transferred to the user space & the traced function is only different. (trace_output example traces sys_write, and in my case, I trace bio_endio)

And this trace_output example has same problem with my program. So maybe you can reproduce same problem with trace_output example. From what I checked, the example has the same problem. bpf_perf_event_output function in trace_output example also returns EOPNOTSUPP. (or sometimes it returns ENOENT, ENOSPC.... 0 is never returned.)

geonheec avatar Apr 03 '20 11:04 geonheec

@geonheec I tried trace_output with following output

$ sudo ./samples/bpf/trace_output 
recv 343066 events per sec
100018+0 records in
100017+0 records out
51208704 bytes (51 MB, 49 MiB) copied, 0.289763 s, 177 MB/s

small clarification in sample code return vlaue for bpf_perf_event_output is not checked.

can you attach diff/c code as I am unable to reproduce the problem? just want to check if it is related to specific kernel version. I am on 5.5.13.

devidasjadhav avatar Apr 03 '20 14:04 devidasjadhav

trace_output_kern.c

(-) SEC("kprobe/sys_write") (+) SEC("kprobe/bio_endio)

(+) int res; (-) bpf_perf_event_output(ctx, &my_map, 0, &data, sizeof(data)); (+) res = bpf_perf_event_output(ctx, &my_map, 0, &data, sizeof(data)); (+) char msg[] = "res: %d\n"; (+) bpf_trace_printk(msg, sizeof(msg), res);

and trace_output_user.c is not changed.

I ran that program and check the res value with the following method.

sudo su cd /sys/kernel/debug/tracing cat trace_pipe

Maybe you can find "res: -95" on terminal with the above method.

+) I checked again original trace_output example, and it seems works well. But when I changed "sys_write" to "bio_endio", the returned value goes EOPNOTSUPP. Does bpf_perf_event_output not work well on interrupt context...?

geonheec avatar Apr 03 '20 15:04 geonheec

So I think the problem is this, when libbpf sets up perf event buffers for bpf perf event maps, it assumes each index in the map is for a specific CPU. i.e. index 0 is for CPU 0, index 1 for CPU 1, etc. This effectively means you can only use bpf_perf_event_output with BPF_F_CURRENT_CPU.

yshui avatar Jun 22 '24 17:06 yshui