tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Optimize BPF maps with BPF_F_NO_PREALLOC to reduce memory usage

Open kyledong-suse opened this issue 2 months ago • 22 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Is your feature request related to a problem?

As mentioned part of #4191, while testing Tetragon’s policy filter implementation (pkg/policyfilter/map.go), I noticed that the inner per-policy maps (policy_%d_map) currently use a fixed size(32768) and the default preallocated hash map mode:

// addPolicyMap adds and initializes a new policy map
func (m PfMap) newPolicyMap(polID PolicyID, cgIDs []CgroupID) (polMap, error) {
	name := fmt.Sprintf("policy_%d_map", polID)
	innerSpec := &ebpf.MapSpec{
		Name:       name,
		Type:       ebpf.Hash,
		KeySize:    uint32(unsafe.Sizeof(CgroupID(0))),
		ValueSize:  uint32(1),
		MaxEntries: uint32(polMapSize),   // currently const = 32768
	}
        ...
}

This causes significant memory preallocation even when the number of tracked cgroups per policy is small.

I tested using the following tracing policy:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "policy-1"
spec:
  podSelector:
    matchLabels:
      app: "ubuntu"
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "/usr/bin/sleep"
        - "/usr/bin/cat"
        - "/usr/bin/my-server-1"
      matchActions:
      - action: Override
        argError: -1
  options:
  - name: disable-kprobe-multi
    value: "1"

After applying the policy, the resulting map allocation was:

278096: hash  name policy_1_map  flags 0x0
	key 8B  value 1B  max_entries 32768  memlock 2622752B
	pids tetragon(1208075)

This corresponds to ~2.6 MB of memory preallocated for a single inner map, which is excessive given that only a handful of cgroups are typically tracked per policy.

Describe the feature you would like

We want to optimize the pre-allocation memory only for actual entries, reducing the footprint.

Describe your proposed solution

I wonder if we can enable BPF_F_NO_PREALLOC for these inner maps—similar to how Tetragon already handles certain maps in https://github.com/cilium/tetragon/blob/main/pkg/sensors/tracing/selectors.go.

This change ensures that each inner policy map (policy_%d_map) uses lazy allocation instead of preallocating all hash buckets upfront. It will not affect outer map creation or map-of-maps semantics.

With this optimization, memory usage improves significantly. For example, with the same configuration and tracing policy:

279555: hash  name policy_1_map  flags 0x1
	key 8B  value 1B  max_entries 32768  memlock 525312B
	pids tetragon(1350594)

This represents only ~0.5 MB of memory, compared to ~2.6 MB without the flag—an approximate 80% reduction in memory preallocation per policy map in this example.

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

kyledong-suse avatar Oct 25 '25 02:10 kyledong-suse

cc @kkourt

olsajiri avatar Oct 30 '25 08:10 olsajiri

Just a recall of what I wrote on BPF_F_NO_PREALLOC in the other issue:

This is a tool that can be leveraged for reducing memory use of maps but:

  1. It doesn't apply on all the maps
  2. We could argue that it mostly delays memory consumption instead of reducing it

I would use it in the last steps of trying to tune memory use of those tbh. If you think you will gain memory long term by having NO_PREALLOC, your map is certainly just too large and could be resized.

I think if we want to tackle the BPF map memory use problem again we should:

  1. Verify if it's an actual issue.
  2. Check which maps are the biggest consumers.
  3. (hopefully we already fix this) If the map is not used because the feature is not enabled, resize it to 1 to minimize impact.
  4. Resize the maps to a better size, add a flag for resizing or group that size along other maps.
  5. Then consider using NO_PREALLOC to optimize startup memory / situations in which we don't want to scale.

Originally posted by @mtardy in #4191

So to recap I'm not against BPF_F_NO_PREALLOC but I'll consider also reducing the size of the map if that makes sense in this specific case. A lot of our maps are of size 2^15 for no particular reason.

mtardy avatar Oct 30 '25 09:10 mtardy

Hi @mtardy, thank you so much for the thoughtful guidance on BPF_F_NO_PREALLOC and map sizing. Yes, I had read your comment in https://github.com/cilium/tetragon/issues/4191#issuecomment-3411447022 very carefully.

Actually, we had evaluated reducing the map size, but the challenge for this specific map (policy_id → cgroup_id inner maps) is that its contents are dynamic: cgroup IDs are added/removed at runtime as pods/containers come and go, so a single fixed size chosen at initialization can be either wasteful or prone to overflow later.

We think use BPF_F_NO_PREALLOC flag on these inner maps can avoid paying the full cap up front and to handle runtime growth without guesswork.

If this approach sounds reasonable, we're happy to send a PR. Thanks.

kyledong-suse avatar Oct 30 '25 23:10 kyledong-suse

Hi @mtardy and @kkourt just checking in to see if there’s been any update or thoughts on this one.

I’ve been thinking about a potential workaround: we could consider using some headroom when initializing the inner per-policy maps (policy_%d_map).

Right now, the max entries are set to 32768. Even if we know the size of the []CgroupID list and try to double or triple it during initialization, it can still grow dynamically at runtime — so resizing safely seems tricky.

Do you think adding headroom is a reasonable direction, or is there a more robust way to handle dynamic growth of these inner maps?

Thanks again for any input. I really appreciate your time!

kyledong-suse avatar Nov 06 '25 22:11 kyledong-suse

Hello,

Actually, we had evaluated reducing the map size, but the challenge for this specific map (policy_id → cgroup_id inner maps) is that its contents are dynamic: cgroup IDs are added/removed at runtime as pods/containers come and go, so a single fixed size chosen at initialization can be either wasteful or prone to overflow later.

That's a very common tradeoff when using BPF maps (because, generally speaking, it's challenging to dynamically size them).

One thing we would need to figure out is whether it is allowed to have inner maps having different sizes. From what I recall, this is not allowed in older kernel versions (e.g., see https://lore.kernel.org/bpf/[email protected]/)

Regarding BPF_F_NO_PREALLOC, this would only work if, for the whole duration of the tetragon's agent lifetime, there are policies which do not match workloads on the node. Because if, at some point, a workload that matches this policy was executed in the node, then the inner map would be populated.

We could provide a knob to statically set the inner map entries to something else than 32768 (which I do agree is a large number, but this is by design), but if we do, we should have a clear way to communicate to users failure to add entries to this map because this would mean that policies are not applied on certain workloads.

By adding headroom, I'm guessing mean to have each inner map sized based on a function f, e.g., f(x) = 3x where x is the current size of the croup IDs? If we want to do this per policy, then it would only work for kernels that support different inner maps, which is a subset of the kernels we have to support.

Maybe it makes sense to start with a way to configure the max entries for the inner maps? At least this would give users a way to address the problem of balanced workloads where the number of workloads matching a policy is balanced, and it allows us to figure out the error path when the maps overflow, which we are going to need anyway if we are starting to play with these sizes.

Thoughts?

kkourt avatar Nov 07 '25 08:11 kkourt

Hi @kkourt, Thanks for the detailed response!

That makes sense regarding the dynamic nature of the policy_id → cgroup_id inner maps, especially given the churn of cgroups as pods come and go. I also agree that dynamic sizing would be quite complex to manage safely, and kernel support for variable-sized inner maps is still a limiting factor on older versions.

For BPF_F_NO_PREALLOC, I just wanted to clarify one point. Please feel free to correct me if I'm wrong, from my understanding, even when this flag is used, the map won’t consume the full capacity upfront. It only allocates memory for keys that are actually inserted. So if a policy eventually matches workloads, the map will gradually grow with actual usage, not pre-allocate the full number of entries.

I also noticed that Tetragon already uses BPF_F_NO_PREALLOC in some maps, for example, https://github.com/cilium/tetragon/blob/main/pkg/sensors/tracing/selectors.go#L224. So the pattern is not entirely new to the codebase.

If my understanding of the usage for BPF_F_NO_PREALLOC is correct, I think maybe we can take advantage of this flag for this map. Meanwhile, if you like I can also implement a configurable static knob for this map as well.

WDYT?

kyledong-suse avatar Nov 07 '25 15:11 kyledong-suse

For BPF_F_NO_PREALLOC, I just wanted to clarify one point. Please feel free to correct me if I'm wrong, from my understanding, even when this flag is used, the map won’t consume the full capacity upfront. It only allocates memory for keys that are actually inserted. So if a policy eventually matches workloads, the map will gradually grow with actual usage, not pre-allocate the full number of entries.

Thanks for the clarification. From quickly looking at the kernel code (https://elixir.bootlin.com/linux/v6.17.7/source/kernel/bpf/hashtab.c) my impression is that you are indeed correct. It would be great if we could verify that this is indeed the case (e.g., with an experiment that quantifies the memory savings). Based on above, using BPF_F_NO_PREALLOC makes sense to me. It would also be nice to hear from @mtardy who has done a lot of work in this area.

I would, also, expect that BPF_F_NO_PREALLOC consumes more CPU, but it's not clear if that's an issue. If it is, we might consider adding a flag to enable-disable it.

kkourt avatar Nov 07 '25 15:11 kkourt

@kkourt, Thank you very much for confirming! That aligns with what I was thinking. I actually ran a small POC using BPF_F_NO_PREALLOC, as mentioned in the issue description. The results show about 0.5 MB of memory usage with the flag, compared to around 2.6 MB without it - roughly an 80% reduction in preallocated memory per policy map in the example.

I understand your concern regarding potential CPU overhead. If we observe significant impact, making this flag configurable via a startup option or a policy-level flag, as you suggested, sounds like a good approach.

Let’s wait for @mtardy’s input. If you both agree that using BPF_F_NO_PREALLOC for this map makes sense, I’d be happy to move forward with implementing it. We can make the configurability (enable/disable) part a follow-up step afterward.

WDYT?

kyledong-suse avatar Nov 07 '25 15:11 kyledong-suse

Yeah I think it's fine, I have nothing against BPF_F_NO_PREALLOC, it can definitely be helpful!

We could provide a knob to statically set the inner map entries to something else than 32768 (which I do agree is a large number, but this is by design), but if we do, we should have a clear way to communicate to users failure to add entries to this map because this would mean that policies are not applied on certain workloads.

It seems to me that settings large numbers to make these issues disappear is like kicking the can down the road. Ideally these maps should be sized for having reasonable memory use while not immediately breaking. And having a mechanism of alerts (logs, metrics) for when it fails from the beginning. It's really hard to size since cluster size greatly varies but maybe we should base our estimations on cluster size stats, we could agree on a "typical cluster size" and base most of our sizing on it.

mtardy avatar Nov 10 '25 09:11 mtardy

It seems to me that settings large numbers to make these issues disappear is like kicking the can down the road. Ideally these maps should be sized for having reasonable memory use while not immediately breaking. And having a mechanism of alerts (logs, metrics) for when it fails from the beginning. It's really hard to size since cluster size greatly varies but maybe we should base our estimations on cluster size stats, we could agree on a "typical cluster size" and base most of our sizing on it.

I am not sure if I understand your point. Are you discussing what the default value should be or whether there should be a switch to configure a different value than the default?

kkourt avatar Nov 10 '25 10:11 kkourt

I am not sure if I understand your point. Are you discussing what the default value should be or whether there should be a switch to configure a different value than the default?

The default value. This is maybe out of scope for this but my first point https://github.com/cilium/tetragon/issues/4249#issuecomment-3466765439 was just that BPF_F_NO_PREALLOC is an optimization, not really a memory saving measure. And that generally it should be great to have guidance on a "typical cluster size" so that we size appropriately instead of just writing 32768 without thinking much everywhere as we pretty much do now.

mtardy avatar Nov 10 '25 10:11 mtardy

Thanks @kkourt and @mtardy! This has been a very great discussion, and I completely agree with the points you raised.

Using BPF_F_NO_PREALLOC seems like a good first step toward more efficient memory usage while keeping things safe. I totally agree that having clear guidance for “typical cluster sizing” and better observability (logs/metrics when maps reach capacity) would be ideal longer-term improvements.

For now, I’d like to proceed with adding BPF_F_NO_PREALLOC as an enhancement for this specific map as a starting point. Once we have that in place, we can continue the discussion around sizing defaults and possible user-configurable knobs with user's feedback.

kyledong-suse avatar Nov 10 '25 14:11 kyledong-suse

I believe we should consider adding a flag anyway. It will allow us to easily compare the two approaches, and act as a backup in case there are unwanted side-effects from this change. I would also expect that adding the flag is not too coomplicated.

kkourt avatar Nov 17 '25 13:11 kkourt

I agree. I’ll add a user-configurable knob enable-policy-filter-no-prealloc. With this flag, users can choose whether to enable or disable BPF_F_NO_PREALLOC for the inner policy maps, depending on their needs.

kyledong-suse avatar Nov 17 '25 15:11 kyledong-suse

Hi @kkourt and @mtardy, I took another look at our use cases and ran a quick POC (with the same tracing policy that I mentioned in the issue description) to measure the actual memory impact. It turns out the preallocated memory for the override_tasks is still quite large per policy:

97703: hash  name override_tasks  flags 0x0
	key 8B  value 4B  max_entries 32768  memlock 2622752B
	btf_id 101182
	pids tetragon(157312)

With BPF_F_NO_PREALLOC enabled, the memory usage drops significantly for override_tasks map from ~2.6 MB to ~0.5 MB.

98032: hash  name override_tasks  flags 0x1
	key 8B  value 4B  max_entries 32768  memlock 525312B
	btf_id 101521
	pids tetragon(212884)

Given this footprint, I’m wondering if it would make sense to apply the same idea here and introduce a user-configurable knob to enable or disable BPF_F_NO_PREALLOC for override_tasks map as well. This would give users more flexibility to balance memory usage versus performance depending on their environment.

Does this sound reasonable?

kyledong-suse avatar Nov 17 '25 21:11 kyledong-suse

It will allow us to easily compare the two approaches, and act as a backup in case there are unwanted side-effects from this change.

Yeah maybe we can have a flag in common for all those changes so that we don't add N flags for each of these maps.

Does this sound reasonable?

Yep sure!

By the way for sizing, I know this is a different piece of work than what's talked in this issue, maybe you are interested in this, but I asked a few questions here and there and got the answer that in most of the public reports (which focus on OSS stuff, not enterprise that can greatly vary of course), we should consider 100-500 nodes ranges and small nodes like 4-8 cores and 16GB of RAM. We could start with that in mind, and start sizing maps according to this and create a new pkg to provide values for a set of default: small/medium/large.

The general idea would be to have a flag like --map-size=small, --map-size=medium (we can think about better naming for sure 😄), that could default to medium and then we could provide a bunch of func to outputs sane max numbers for each categories:

  • size for typical number of procs on the machine
  • size for typical number of workloads in the cluster
  • size for typical number of policies
  • etc.

I think this approach would avoid the current situation where we have a lot of maps that are oversized and just use so much memory for nothing. But this is a bit more long term than just the BPF_F_NO_PREALLOC feature here, that is a good quick solution :)!

mtardy avatar Nov 20 '25 11:11 mtardy

@mtardy Thanks for the feedback!

On the flag approach:

Yeah maybe we can have a flag in common for all those changes so that we don't add N flags for each of these maps.

I think that's a very good suggestion. I'll refactor the flag to --bpf-map-no-prealloc and apply it to both the policy_%d_map and the override_tasks map. This keeps it general and reusable for future maps that need this flag.

On the sizing work:

The sizing approach you mentioned(--map-size=small/medium/large) sounds a really great idea, and it makes sense for the longer term improvement. It would help to avoid oversized maps and reduce memory usage. I’m happy to help with that when we get to it, but I’ll keep it separate from this issue.

kyledong-suse avatar Nov 20 '25 20:11 kyledong-suse

@mtardy Thanks for the feedback!

On the flag approach:

Yeah maybe we can have a flag in common for all those changes so that we don't add N flags for each of these maps.

My preference would be to have a way to tweak this option for every map that we support it. I find having these knobs very useful in practice, even if they are rarely used.

Can we have both? That is, a flag for everything and then the ability to set individual flags?

kkourt avatar Nov 21 '25 15:11 kkourt

Also, it does not have to be one flag per map. We can have something like:

--bpf-maps-prealloc=true --bpf-map-prealloc-disable=map1,map2

Where we enable prealloc on all maps, and disable on specific maps (or vice-versa depending on what is the default we want).

kkourt avatar Nov 21 '25 15:11 kkourt

Also, it does not have to be one flag per map. We can have something like: --bpf-maps-prealloc=true --bpf-map-prealloc-disable=map1,map2 Where we enable prealloc on all maps, and disable on specific maps (or vice-versa depending on what is the default we want).

@kkourt I think that's a really great idea!

I’m proposing the following two flags by default:

  • --bpf-maps-prealloc=false
  • --bpf-map-no-prealloc=map1,map2

The global flag takes precedence. So the logic would be:

  • If --bpf-maps-prealloc=false, then all maps get BPF_F_NO_PREALLOC set regardless of what's in --bpf-map-no-prealloc.
  • If --bpf-maps-prealloc=true, then only maps listed in --bpf-map-no-prealloc get BPF_F_NO_PREALLOC set.

We could tweak the naming a bit if needed. WDYT?

kyledong-suse avatar Nov 22 '25 01:11 kyledong-suse

Above makes sense to me. A few notes:

  • Presumably, we will support setting this flag only for a subset of maps (at least in the first PR). It would be nice if the list of maps for which this is supported, is shown in --help for both --bpf-maps-prealloc and also for --bpf-map-no-prealloc.

  • If --bpf--maps-prealloc=false and --bpf-map-no-prealloc is set, we can emit a warning to capture user errors.

We can start with a small list of maps that support this in the first PR. If this ends up being extended in the future, it might be useful to come up with naming where we can define exceptions regardless of what the default is.

Two ideas are:

First:

--bpf-maps-prealloc=true|false
--bpf-maps-prealloc-exceptions=map1,map2 # exceptions to default

Second:

--bpf-maps-prealloc=true|false
--bpf-maps-disable-prealloc=map1,map2 # if default is true
--bpf-maps-enable-prealloc=map1,map2 # if default is false, and we can add this in the future as needed.

Thoughts? (I don't have a strong opinion on the names themselves, but I think we should try and figure out something that allows us to express exceptions for both defaults in the future if we need to.)

kkourt avatar Nov 24 '25 10:11 kkourt

@kkourt, Thanks for the detailed feedback — all points make sense.

I like the your option#2 which is clearer and not more complicated compare to option#1. It will have two explicit lists, and only one is used at a time based on the default. So the logic will be if default=true, check disable list; if default=false, check enable list. I'll implement --bpf-maps-prealloc=true|false and --bpf-maps-disable-prealloc=map1,map2 # if default is true for this issue.

If the default changes to false in the future, add --bpf-maps-enable-prealloc=map1,map2 to enable preallocation for specific maps as needed.

kyledong-suse avatar Nov 25 '25 01:11 kyledong-suse