libs FR: TOCTOU mitigation without userspace enter events

Motivation

TOCTOU mitigation should be handled on the kernel side to enable disabling of userspace enter events as proposed here.

Feature

As described in the proposal:

The mitigation implemented in this PR (https://github.com/falcosecurity/libs/pull/235) uses enter events. How can we achieve the same result without sending these events to userspace?

For the connect syscall, we can use the tuple parameter in the exit event to avoid issues with TOCTOU. So, the mitigation here is to populate the userspace with information directly from the kernel.

For what concerns the other 4 syscalls, the idea is to hook into the specific tracepoints (e.g., tracepoint/sys_enter_open), collect the necessary parameters, and save them in a map indexed by thread ID. So we use a bpf hash map with thread-id as a key and syscall arguments as a value. This is very similar to what we do today when RAW_TRACEPOINTS are not enabled: https://github.com/falcosecurity/libs/blob/0.19.0/driver/bpf/maps.h#L103 In the exit tracepoint, we can retrieve this information and send only one event to userspace. This drastically reduces the instrumentation time with respect to using the generic sys_enter tracepoint like today. Moreover, we won't send the enter event to userspace but we will merge the information directly in the kernel.

We have some options here, and I'm hoping to discuss the specific implementation design here, before embarking on journey to make it happen:

use of hash maps per driver (bpf hash maps where relevant, LRU maps in kernel driver, gvisor ??)
is the mentioned RAW_TRACEPOINT code a good example what is expected
is some refactoring needed in case we extend the implementation

Alternatives

Alternative or additional work: We could have a configuration flag or a compile time flag to allow opting out of TOCTOU mitigation to allow user choose between performance gain and the mitigation.

Additional context

Linked to configuration flag feature request, if we choose to implement runtime configuration flag for disabling enter events. Link to issue.

May 19 '25 06:05 tuminoid

Ping @terror96, @leogr, @ekoops

May 19 '25 06:05 tuminoid

These are details, but currently stash_map is defined only when BPF_SUPPORTS_RAW_TRACEPOINTS is not defined. Should we define a new stash_map or just use the already defined hash map also for storing args in creat, open, openat and openat2 entry handlers? However, stash_map is only defined for "legacy" bpf driver and we need to use something similar to other drivers as well in which case it could make sense to define a new, common structure name for all drivers just for storing entry parameters in these special case. Even that the implementation/definition & usage of the structure is most likely a bit different e.g. kmod vs "modern" bpf.

edit: from memory consumption point of view, it would be better to utilize existing structures if possible such as stash_map (bpf), auxmap (modern bpf)... asfaik, for kmod we would need to define a new temp structure.

May 27 '25 08:05 terror96

We could have a configuration flag or a compile time flag to allow opting out of TOCTOU mitigation to allow user choose between performance gain and the mitigation.

I would allow users to opt out only if performance concerns justify this need. So, I'd say to postpone the decision once we have some working implementation.

May 27 '25 09:05 leogr

/milestone 0.22

Jul 04 '25 07:07 leogr

@leogr: The provided milestone is not valid for this repository. Milestones in this repository: [0.22.0, TBD, next-driver]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 0.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 04 '25 07:07 poiana

/milestone 0.22.0

Jul 04 '25 07:07 leogr

/assign

Jul 04 '25 07:07 ekoops

/assign

Jul 08 '25 06:07 terror96

As we are approaching the end of https://github.com/falcosecurity/libs/issues/2427, I would like to revamp this discussion with some thoughts.

As briefly discussed in the mentioned issue and on this issue, the way to keep TOCTTOU mitigation while removing the sys_enter raw tracepoint programs is adding ad-hoc sys_enter tracepoint programs collecting the needed syscall parameters.

Actually, as stated in the proposal, long-term mitigation would involve attaching to internal kernel hooks rather than using the syscall enter hooks, but currently, we only support tracepoints as hook points in our drivers.

Parameters collected on these ad-hoc tracepoints need to be sent to userspace. We have two possible ways of doing so:

using some kind of hash map storing the association between the current thread TID and the parameters - in this scenario, the sys_enter tracepoint program would save the parameters it wants to share on this map, and the sys_exit raw_tracepoint program would retrieve them and use to build the whole exit event which will send to userspace
keep pushing enter events from the sys_enter tracepoint programs

The first solution would simplify userspace handling logic, but I have some concern on the performance penalties introduced by the operation on the shared hash map and the additional copies to bring the parameters to userspace through the ring buffer on the exit programs side. The second solution would be easier to implement, but would not benefit of any userspace handling logic simplification.

One final thought is related to the connect events couple, for which we currently have TOCTTOU mitigation. The current connect exit fillers/programs leverage the user-provided sockaddr to build part of the tuple parameter if these information cannot be gathered from kernel data structures (because they are not present). If we decide to not leverage the user-provided sockaddr anymore, we can avoid introducing the corresponding sys_enter tracepoint program to mitigate the TOCTTOU vulnerability, as we would have all the required trustable information on the exit event. As we still need to agree on the fact that not using anymore the user-provided sockaddr is the right choice, in my last PR I didn't move forward with this, and I temporarily kept the enter event handling logic.

What are your thoughts on this? @terror96 @falcosecurity/libs-maintainers

Jul 08 '25 08:07 ekoops

As we are approaching the end of #2427, I would like to revamp this discussion with some thoughts.

As briefly discussed in the mentioned issue and on this issue, the way to keep TOCTTOU mitigation while removing the sys_enter raw tracepoint programs is adding ad-hoc sys_enter tracepoint programs collecting the needed syscall parameters.

Actually, as stated in the proposal, long-term mitigation would involve attaching to internal kernel hooks rather than using the syscall enter hooks, but currently, we only support tracepoints as hook points in our drivers.

Parameters collected on these ad-hoc tracepoints need to be sent to userspace. We have two possible ways of doing so:
1. using some kind of hash map storing the association between the current thread TID and the parameters - in this scenario, the sys_enter tracepoint program would save the parameters it wants to share on this map, and the sys_exit raw_tracepoint program would retrieve them and use to build the whole exit event which will send to userspace

2. keep pushing enter events from the sys_enter tracepoint programs
The first solution would simplify userspace handling logic, but I have some concern on the performance penalties introduced by the operation on the shared hash map and the additional copies to bring the parameters to userspace through the ring buffer on the exit programs side. The second solution would be easier to implement, but would not benefit of any userspace handling logic simplification.

One final thought is related to the connect events couple, for which we currently have TOCTTOU mitigation. The current connect exit fillers/programs leverage the user-provided sockaddr to build part of the tuple parameter if these information cannot be gathered from kernel data structures (because they are not present). If we decide to not leverage the user-provided sockaddr anymore, we can avoid introducing the corresponding sys_enter tracepoint program to mitigate the TOCTTOU vulnerability, as we would have all the required trustable information on the exit event. As we still need to agree on the fact that not using anymore the user-provided sockaddr is the right choice, in my last PR I didn't move forward with this, and I temporarily kept the enter event handling logic.

What are your thoughts on this? @terror96 @falcosecurity/libs-maintainers

I think which ever way we choose to go, it would most likely be beneficial to also have an option for the user to disable mitigation in which case the potential performance penalty is avoided as well as entry events could completely be avoided.

Jul 08 '25 09:07 terror96

I think which ever way we choose to go, it would most likely be beneficial to also have an option for the user to disable mitigation in which case the potential performance penalty is avoided as well as entry events could completely be avoided.

Totally agree, the mitigation should be on by default and the user could choose to disable it.

Jul 08 '25 09:07 ekoops

Only random thought, but do we need to send the original value to the user level what if we just store the original value in a temp buffer upon enter, run memcmp in the kernel upon exit and send the cmp result instead?

Jul 09 '25 10:07 terror96

Only random thought, but do we need to send the original value to the user level what if we just store the original value in a temp buffer upon enter, run memcmp in the kernel upon exit and send the cmp result instead?

The current behaviour aims to neutralize a possible attack by using the enter event parameter, but does not detect that the attack happened. On the contrary, your proposed approach will detect that the attack happened, but will not allow sinsp to use the right parameter values to update its internal state. However, if we accept the overhead of copying the data on a temporary buffer, I guess we can also accept copying the data from the temporary buffer to the auxiliary map: this last statement is motivated by the fact that I'm realizing that copying a parameter from kernel memory to the auxiliary map (e.g.: copying the open syscall's name parameter with bpf_probe_read_{kernel,user}_str) should be as expensive as copying it from the temporary buffer to the auxiliary map. Notice that "copying a parameter from kernel memory to the auxiliary map" is a cost that we have already accepted.

Jul 09 '25 10:07 ekoops

Only random thought, but do we need to send the original value to the user level what if we just store the original value in a temp buffer upon enter, run memcmp in the kernel upon exit and send the cmp result instead?

The current behaviour aims to neutralize a possible attack by using the enter event parameter, but does not detect that the attack happened. On the contrary, your proposed approach will detect that the attack happened, but will not allow sinsp to use the right parameter values to update its internal state.

True, better to keep the original behavior in favor of neutralization.

Jul 10 '25 06:07 terror96

I tried to scratch something based on the first approach mentioned in my comment:

using some kind of hash map storing the association between the current thread TID and the parameters - in this scenario, the sys_enter tracepoint program would save the parameters it wants to share on this map, and the sys_exit raw_tracepoint program would retrieve them and use to build the whole exit event which will send to userspace

Let's call the aforementioned hash map shared_data_map. Moreover, I'm gonna call the auxiliary map auxmap. For simplicity, I'm only going to discuss the new findings in the enter programs, and take the open system call as an example, since the exit program should only copy the data from shared_data_map into auxmap (the open's filename parameter), as we expect.

First of all, let's analyze if the hash map (BPF_MAP_TYPE_HASH) is the right choice. For sure, we cannot choose a per-cpu map, as enter and exit programs can run on different CPUs. But we could use an array... Why we should not use an array? Because of its memory footprint: using an array would require allocating an entry, of considerable size (4096 seems to be a good maximum filepath size), for each possible pid (at least 32678); this means 133 MB! Notice also that hash maps are pre-allocated by default, but we can disable the pre-allocation by using the flag BPF_F_NO_PREALLOC.

The following is a possible implementation of the enter program, together with the map definitions:

typedef char buffer[4096];

struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__uint(max_entries, 32678);
	__type(key, pid_t);
	__type(value, buffer);
    __uint(map_flags, BPF_F_NO_PREALLOC);
} shared_data_map SEC(".maps");

struct {
	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
	__type(key, __u32);
	__type(value, buffer);
	__uint(max_entries, 1);
} scratch_buf_map SEC(".maps");

struct sys_enter_open_context {
	__u64 pad;
	__u32 syscall_nr;
	__u64 filename;
	__u64 flags;
	__u64 mode;
};

SEC("tp/syscalls/sys_enter_open")
int sys_enter_open_prog(struct sys_enter_open_context *ctx) {
	const __u32 zero = 0;
	buffer *scratch_buf = bpf_map_lookup_elem(&scratch_buf_map, &zero);
	if (!scratch_buf) {
		bpf_printk("enter: failed to get scratch buffer\n");
		return -1;
	}

	const char *filename = (const char *)ctx->filename;
	const long written_bytes = bpf_probe_read_user_str(scratch_buf, sizeof(buffer), filename);
	if (written_bytes < 0) {
		bpf_printk("enter: failed to write filename into scratch buffer\n");
		return -1;
	}

	const __u32 pid = (__u32) bpf_get_current_pid_tgid();
	int res = bpf_map_update_elem(&shared_data_map, &pid, scratch_buf, BPF_ANY);
	if (res < 0) {
		bpf_printk("enter: failed to write data into shared_data_map\n", res);
		return -1;
	}
	return 0;
}

To the extent of my knowledge, there is no eBPF hash maps helper allowing to request the creation of a map entry without providing any data for the entry's value. This means that we must stick with using bpf_map_update_elem(), and providing a pointer to a valid buffer... However, the buffer size is 4096 in this case: we cannot create a buffer directly on the stack and provide its pointer to the helper (because the BPF stack is limited to 512 bytes), so we must use a separate per-cpu scratch buffer map (scratch_buf_map).

As you can see, this resulted in a copy of the filename into the retrieved scratch space, and a copy of the scratch space into the map: these are two big copies! Moreover, we disabled the pre-allocation, so the kernel sometimes would need to allocate some considerable amount of space for us. Switching back to the array doesn't seem to be a good idea, as this would bring back the big pre-allocation problem, even if it would avoid us making the first additional copy into the scratch space.

Having said that, I would suggest proceeding with original plan of pushing data directly from the enter programs, even if this would not allow to simplify more the userspace code :pensive:

Jul 30 '25 10:07 ekoops

I think your findings are valid, and I'm not aware of any magic around e.g. the stack limit, because that is set by the kernel not eBPF. I would also agree with no-go for preallocating anything so large in the kernel space.

Maybe the only way around this is to make it possible to disable TOCTTOU support (if the user chooses to), which would allow libs to omit also enter events for those particular syscalls required to support TOCTTOU. From implementation point of view, it would be required to make sure that the implementation does not break if the information for the enter events is not available.

void sinsp_parser::parse_open_openat_creat_exit(sinsp_evt &evt) const seems e.g. already now check that if lastevent_retrieved is set (i.e. enter event was retrieved successfully), and thus it does not break if no such event is present. However, I'm not sure where else dependency on the enter event information for parsing these particular TOCTTOU related syscall events is possibly hidden in the code.

Jul 30 '25 12:07 terror96

Yes, that code path is already handling the possible absence of the enter event. I still need to manage some corner case related to https://github.com/falcosecurity/libs/pull/2068, but after this (since nobody complained) I can work on the tracepoint implementation. Of course this would require us to provide the flag you are mentioning.

Jul 31 '25 07:07 ekoops

I want to recap here my new findings related to implementing the original solution using tracepoints in the modern probe:

tracepoints programs, attached to syscalls/sys_enter_*, are not triggered for ia-32 system calls, leaving the solution relying only on tracepoints incomplete
tried to use fentry programs, attached to __ia32_compat_<syscall_name>/__ia32_<syscall_name> symbols, to specifically target ia-32 system call invocation, but solution is limited by the fact that, for some specific system calls, the kernel verifier prevents us to do anything with the pt_regs argument (https://stackoverflow.com/questions/72824924/invalid-bpf-context-access-when-trying-to-read-regs-parameter), The bug was fixed in recent kernel versions, but is still present on some old kernel flavors we are supporting

For system calls affected by the pt_regs problem, I'm gonna use a different approach: try to create a single fentry program hooking on an internal (stable) kernel function. This will avoid defining different programs for different flavors of the same system call, as the single fentry program will cover both 64 bit and ia-32 system call invocations.

Even if the fentry approach introduces a little delay (negligible in my opinion, as you can see here), I'm implementing solutions that avoid tail-calling, so the impact should be compansated.

Aug 18 '25 08:08 ekoops

After some attempts, I realized that using a single fentry program is not the right way to move forward, as it can still be vulnerable to fentry-related bugs in old kernel versions. I created https://github.com/falcosecurity/libs/pull/2581, which solves the issue using two distinct kprobe programs for ia-32 emulated system calls. See the PR description for more details.

Aug 19 '25 10:08 ekoops