ebpf icon indicating copy to clipboard operation
ebpf copied to clipboard

link: add maxactive for kretprobe

Open alahaiyo opened this issue 2 years ago • 13 comments

In some case we want to missed as little as possible and it is helpful to have this configuration option.

alahaiyo avatar Aug 03 '22 04:08 alahaiyo

Thank you for your PR!

Can you explain your use case for maxactive and why implementing your own sampling in BPF isn't sufficient?

As it stands, the PR is too special cased for me:

  • doesn't work with pmu probes
  • only supported for retprobes (?)

Kprobes have a large API, and supporting all of it is not possible. So we need a very compelling reason to add something like your proposed change.

If the traced function can sleep, then when the number of processes exceeds the number of CPUs, the kretprobe will not be triggered. Because the kernel defaults maxactive to NR_CPU. When we trace some kernel network functions (such as inet_accept), when the network connection pressure is very high, there will be a lot of kretprobe not triggered.

When writing kprobe_event through tracefs, you can set maxactive to a maximum of 2048, which is enough. The implementation of pmu probe in the kernel will call create_local_trace_kprobe, in which maxactive is hardcoded to 0.

alahaiyo avatar Aug 03 '22 13:08 alahaiyo

Thanks for the background, that's interesting.

So what is the behaviour of a pmu retprobe in the same scenario? It drops retprobe calls, without being able to work around the limitation? Asked another way, what does it mean that maxactive is 0 in the pmu case?

What problem does maxactive guard against in the first place? Maybe we could just hardcode it to 2048 in the tracefs case?

Cc @mmat11

lmb avatar Aug 04 '22 10:08 lmb

It drops retprobe calls, without being able to work around the limitation? Asked another way, what does it mean that maxactive is 0 in the pmu case?

I believe this is the case

What problem does maxactive guard against in the first place?

I found an explanation here https://github.com/iovisor/bcc/issues/1072#issue-217189257

Maybe we could just hardcode it to 2048 in the tracefs case?

From this commit https://github.com/torvalds/linux/commit/696ced4fb1d76802f864d8848aa4716633f83c17 it seems the maximum is 4096. I guess there's some overhead (see https://elixir.bootlin.com/linux/v5.19/source/kernel/kprobes.c#L2202) in having a big number, else it could have been hardcoded in the kernel itself (?).

I wonder why this parameter isn't configurable via perf_event_open. IMO we can support the feature but we should raise an error when it's not possible to do so (kernel < 4.12 or maxactive > 0 on non-return probes) and clearly document that setting this option will force the probe to be installed via tracefs

mmat11 avatar Aug 04 '22 13:08 mmat11

hi, @mmat11

I wonder why this parameter isn't configurable via perf_event_open.

Tow related commits(not merged):

Enable specifying maxactive for fd based kretprobe. This will be useful
for tracing tools like bcc and bpftrace. [1] discussed the need of this
in bpftrace. Use highest highest 12 bit (bit 52-63) to allow maximal
maxactive of 4095.

[1] https://github.com/iovisor/bpftrace/issues/835
From: Song Liu <[email protected]>

Enable specifying maxactive for fd based kretprobe. This will be useful
for tracing tools like bcc and bpftrace (see for example discussion [1]).
Use highest 4 bit (bit 59-63) to allow specifying maxactive by log2.

The original patch [2] seems to be fallen through the cracks and wasn't
applied. I've merely rebased the work done by Song Liu, verififed it
still works, and modified to allow specifying maxactive by log2 per
suggestion from the discussion thread.

Note that changes in rethook implementation may render maxactive
obsolete.

[1]: https://github.com/iovisor/bpftrace/issues/835
[2]: https://lore.kernel.org/all/[email protected]/

It seems to be discussing a more elegant way to implement.

IMO we can support the feature but we should raise an error when it's not possible to do so (kernel < 4.12 or maxactive > 0 on non-return probes) and clearly document that setting this option will force the probe to be installed via tracefs

Agree, I also expect this feature.

nashuiliang avatar Aug 05 '22 08:08 nashuiliang

Thank you for your reply!

It drops retprobe calls, without being able to work around the limitation? Asked another way, what does it mean that maxactive is 0 in the pmu case?

I believe this is the case

Agree

What problem does maxactive guard against in the first place?

I found an explanation here iovisor/bcc#1072 (comment)

Maybe we could just hardcode it to 2048 in the tracefs case?

From this commit torvalds/linux@696ced4 it seems the maximum is 4096.

Oh my mistake, the maximum is 4096.

I guess there's some overhead (see https://elixir.bootlin.com/linux/v5.19/source/kernel/kprobes.c#L2202) in having a big number, else it could have been hardcoded in the kernel itself (?).

I wonder why this parameter isn't configurable via perf_event_open.

As @nashuiliang said, the feature is under discussion now.

IMO we can support the feature but we should raise an error when it's not possible to do so (kernel < 4.12 or maxactive > 0 on non-return probes) and clearly document that setting this option will force the probe to be installed via tracefs

Yes, I agree. I will add this and commit again.

alahaiyo avatar Aug 08 '22 08:08 alahaiyo

@alahaiyo I'm not sure yet what the best approach is here, so there is a risk that your work might be wasted. I'll read through all the info here and try to reach an opinion today or tomorrow.

lmb avatar Aug 08 '22 09:08 lmb

@alahaiyo I'm not sure yet what the best approach is here, so there is a risk that your work might be wasted. I'll read through all the info here and try to reach an opinion today or tomorrow.

I'm really looking forward to your opinion.

alahaiyo avatar Aug 08 '22 11:08 alahaiyo

From the kernel side we have two separate APIs to create a kretprobe:

  1. via tracefs: supports setting maxactive and missed events are reported via /sys/kernel/debug/tracing/kprobe_profile.

  2. via perf_event_open: supports neither maxactive nor missed events. Peter Zijlstra wants to avoid adding maxactive to p_e_o if possible. A patch to expose missed events has gotten stuck, but the maintainer seems open to it.

On the user space side, maxactive is exposed in bcc and gobpf. bpftrace hardcodes maxactive to 0 and would like to print the number of missed events instead.

Let's consider the implications of the maxactive API as proposed in the PR:

  • On pre 4.12 kernels the user will receive an error from the kernel, since the feature is not supported. So users have to avoid setting maxactive on old kernels. This requires checking kernel version or similar, or feature probing from the library to avoid setting maxactive.
  • On >= 4.12 kernels, we will force use of tracefs if maxactive!=0 even if pmu is available. tracefs behaves poorly when the tracer crashes.
  • If maxactive becomes available on pmu after all, we have to do feature detection to fall back to tracefs for kernels that have pmu but don't have the maxactive API.
  • If maxactive never becomes available on pmu we are forever locked into tracefs for kretprobes, even if regular pmu kprobes don't require maxactive anymore.
  • It's completely unclear what users should set maxactive to, since it needs to account for all callers of a function on the whole system. So being able to set maxactive still doesn't guarantee that all retprobes fire.

All in all, maxactive seems like a bad API. I think bpftrace has it right with exposing the number of missed events instead. If we manage to resurrect the fd-based missed events patch we could provide a somewhat unified API across tracefs and pmu.

@alahaiyo in your PR description you say that you want to "miss as little as possible". Would it be acceptable to you if you instead had a way to see how many misses you had, and scaling your result accordingly?

If not, what value would you set maxactive to? How do you figure out whether the kretprobe needs maxactive in the first place? How do you plan on dealing with kernels < 4.12?

If just misses is not enough, we could add a higher level option like FunctionMaySleep bool or similar instead of Maxactive int. Behind the scenes the lib would try to figure out the best strategy, which for now would be setting maxactive=4096 and forcing tracefs. This is better than maxactive since it avoids exposing the low level implementation detail and gives us flexibility in the future. The downside is that it requires a lot more feature probing and magic in the library itself.

lmb avatar Aug 09 '22 08:08 lmb

@alahaiyo do you notice the missed events on specific machines, or is your goal to reduce misses outright? If there is a problem with a specific machine, is CONFIG_PREEMPT enabled? What is the value of /sys/devices/system/cpu/possible?

lmb avatar Aug 10 '22 14:08 lmb

From https://github.com/iovisor/bcc/pull/2224#issue-412958330:

For kernels without maxactive supports in debugfs, the error resolution is not straightforward. The kernel will still install a probe, but under a different name.

The issue is that < 4.12 kernels will treat the r50:... written to tracefs as the name of the probe. This will need to be handled if we end up adding maxactive after all.

lmb avatar Aug 11 '22 09:08 lmb

@alahaiyo do you notice the missed events on specific machines, or is your goal to reduce misses outright? If there is a problem with a specific machine, is CONFIG_PREEMPT enabled? What is the value of /sys/devices/system/cpu/possible?

It's not on specific machines. CONFIG_PREEMPT is not set and the value of /sys/devices/system/cpu/possible is 0-95.

alahaiyo avatar Aug 16 '22 02:08 alahaiyo

From iovisor/bcc#2224 (comment):

For kernels without maxactive supports in debugfs, the error resolution is not straightforward. The kernel will still install a probe, but under a different name.

The issue is that < 4.12 kernels will treat the r50:... written to tracefs as the name of the probe. This will need to be handled if we end up adding maxactive after all.

Yes, that is a good reference.

alahaiyo avatar Aug 16 '22 03:08 alahaiyo

On pre 4.12 kernels the user will receive an error from the kernel, since the feature is not supported. So users have to avoid setting maxactive on old kernels. This requires checking kernel version or similar, or feature probing from the library to avoid setting maxactive.

We can handle with that.

On >= 4.12 kernels, we will force use of tracefs if maxactive!=0 even if pmu is available. tracefs behaves poorly when the tracer crashes.

In the interface of link.Kprobe, users can ignore maxactive, so the default is 0 and pmu will be used. If the user does not want to missed events, then they can set maxactive by themselves, especially when the misses is unbearable.

If maxactive becomes available on pmu after all, we have to do feature detection to fall back to tracefs for kernels that have pmu but don't have the maxactive API. If maxactive never becomes available on pmu we are forever locked into tracefs for kretprobes, even if regular pmu kprobes don't require maxactive anymore. It's completely unclear what users should set maxactive to, since it needs to account for all callers of a function on the whole system. So being able to set maxactive still doesn't guarantee that all retprobes fire.

Even setting maxactive to 4096 does not guarantee that events will not be missed, and if so we have no other way. But I think the user needs to be given the ability to adjust the possibility of event missed.

If not, what value would you set maxactive to? How do you figure out whether the kretprobe needs maxactive in the first place? How do you plan on dealing with kernels < 4.12?

In fact, I don't know how to set maxactive. In our case it was only found that there were kretprobe events missing and this was not acceptable. The solution to this problem is to increase maxactive to a large enough value.

we could add a higher level option like FunctionMaySleep bool or similar instead of Maxactive int

But I still think exposing maxactive is better than making the user think about whether the function might sleep. Maxactive is considered as an attribute of kretprobe, which can make it easier for users to find out whether the function of trace will have the problem of loss of kretprobe events.

alahaiyo avatar Aug 16 '22 12:08 alahaiyo

I'm glad that this PR can continue to push forward. I have made changes to the code based on the comments. @lmb

alahaiyo avatar Nov 02 '22 10:11 alahaiyo

@alahaiyo I think the issue I pointed out in https://github.com/cilium/ebpf/pull/755#issuecomment-1211758219 is not addressed yet?

lmb avatar Dec 09 '22 14:12 lmb

I found that it already has a query for the event ID after the event is created. So I just need to determine if I need to recreate it. My new commit is pushed. @lmb

alahaiyo avatar Dec 13 '22 02:12 alahaiyo

I like how you integrated the check for unsupported maxactive! In the interests of getting this merged soon I'll push a couple of commits that contain changes I would've asked you to make. There is only one functional change: I'm dropping the fallback to disable maxactive.

Please take a look at my commits and let me know if you are OK with the modifications. Please also update the tests as mentioned.

I total OK with that. And I push my commits which deal with tests and a little bugfix.

alahaiyo avatar Dec 15 '22 12:12 alahaiyo

deal with tests and a little bugfix.

A good catch. I've opted to return os.ErrExist from create... instead, that way we don't have to change the tests. I also included the symbol offset in the even name when removing, since that may be != 0.

lmb avatar Dec 15 '22 13:12 lmb