tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Scaling k8s workload aware tracing policies

Open Andreagit97 opened this issue 2 months ago • 21 comments

Tetragon CFP available here

Hi all! We would like to use Tetragon to implement per-workload runtime security policies across a Kubernetes cluster. The goal is to establish a "fingerprint" of allowed behavior for every Kubernetes workload (Deployment, StatefulSet, DaemonSet), starting with the strict enforcement of which processes each workload is permitted to spawn.

Let's say in our cluster we have two deployments, my-deployment-1 and my-deployment-2, and we want to enforce the following policies:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "policy-1"
spec:
  podSelector:
    matchLabels:
      app: "my-deployment-1"
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "/usr/bin/sleep"
        - "/usr/bin/cat"
        - "/usr/bin/my-server-1"
      matchActions:
      - action: Override
        argError: -1
  options:
  - name: disable-kprobe-multi
    value: "1"
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "policy-2"
spec:
  podSelector:
    matchLabels:
      app: "my-deployment-2"
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "/usr/bin/ls"
        - "/usr/bin/my-server-2"
      matchActions:
      - action: Override
        argError: -1
  options:
  - name: disable-kprobe-multi
    value: "1"

Let's see what Tetragon injects into the kernel today.

eBPF prog point of view

The above 2 policies will result in the following ebpf programs being attached to the security_bprm_creds_for_exec function:

  • generic_kprobe_event (from policy-1) -> other generic kprobe called in tail call
  • generic_kprobe_event (from policy-2) -> other generic kprobe called in tail call
  • generic_fmodret_override (from policy-1)
  • generic_fmodret_override (from policy-2)

Of course, the number of progs will grow linearly with the number of policies (and so k8s workloads in our use case). When the number of policies grows, we hit the following limits:

  1. The first issue we face is the number of programs we can attach to the same hook. In particular, we have a limit of 38 progs if we use BPF_MODIFY_RETURN. This type of program relies on eBPF trampoline and is subject to the BPF_MAX_TRAMP_LINKS limit (38 on x86). https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138.
  2. Let's say we overcome this issue using kprobes + sigkill, now we hit a second limit of 128 policies. This limit is hardcoded in tetragon code https://github.com/cilium/tetragon/blob/47538a07a4e6c51a9cc569f78c42a2cf767c5405/bpf/process/policy_filter.h#L23 probably to take care of memory usage. We can probably overcome this limit as well by making the limit configurable.
  3. Now, the third issue that I think we cannot overcome today is performance overhead. The list of attached programs grows linearly with the number of policies we create. If we have 500 workloads in the cluster, we will have 500 programs attached to the same function. This could lead to a noticeable system slowdown when a new process is created. The slowdown could be even more relevant if we extend this behavior to some other kernel subsystems (e.g., file system/network operations).

eBPF maps point of view

For each of the above policies, I see more or less 50 eBPF maps loaded. Most of them have just 1 entry because they are probably not used, but others can take a great amount of memory. The reported memlock for each policy is around 8 MB. The most memory-intensive maps seem to be:

// inner map for each loaded policy with pod selectors
721: hash  name policy_1_map  flags 0x0 
 key 8B  value 1B  max_entries 32768  memlock 2624000B
 pids tetragon(63603)

// Still need to check if this is really needed (?)
764: lru_hash  name socktrack_map  flags 0x0
 key 8B  value 16B  max_entries 32000  memlock 2829696B
 btf_id 947
 pids tetragon(63603)

// map used for overriding the return value
766: hash  name override_tasks  flags 0x0
 key 8B  value 4B  max_entries 32768  memlock 2624000B
 btf_id 949
 pids tetragon(63603)

As you may imagine also in this case, having 500 deployments in the cluster could lead to a significant memory usage on the node (8 MB* 500 = 4 GB)

Summary

With this issue, we just want to highlight the current limitation in scalability that we are facing. I would love your feedback on this. Do you see any mistakes in this analysis? I'm pretty new to Tetragon, so maybe I missed something, and there is a way to overcome some of the above limitations that I didn't consider. If you confirm these are real limitations and you are interested in supporting this use case, we can maybe discuss possible ideas to address them.

Thank you for your time!

Andreagit97 avatar Oct 14 '25 16:10 Andreagit97

Hey 👋 thanks for opening this issue. Let me give you a first answer to some of the topics here.

On programs

Of course, the number of progs will grow linearly with the number of policies (and so k8s workloads in our use case). When the number of policies grows, we hit the following limits:

Indeed, because those are generic sensors, they are wired to be able to perform many different things. And thus you'll see them systematically attached to hooking points specified by policy. However, the generic programs should be written with in mind that the checks to see if the policy apply or not should be done early to minimize overhead on "non-matching events" (like workloads that doesn't match the k8s label for example). All of that to say that it's the downside of the program being generic and programmed by map values (but at the same time they provide great programmability).

We have debug commands that might help you there, check out tetra debug progs --help.

With more concrete analysis of a specific use case, we should be able to improve efficiency if we can spot "abnormal" overhead. Maybe also in the number of progs.

On maps

For each of the above policies, I see more or less 50 eBPF maps loaded. Most of them have just 1 entry because they are probably not used

Indeed, there's been an effort previously to correctly resize unused map before loading them (see for example https://github.com/cilium/tetragon/pull/2546, https://github.com/cilium/tetragon/pull/2551, https://github.com/cilium/tetragon/pull/2555, https://github.com/cilium/tetragon/pull/2563 or https://github.com/cilium/tetragon/pull/2692, etc.). It is actually tricky to completely remove the unused maps, Cilium give it a go and found a way (see https://github.com/cilium/cilium/pull/40416) but for Tetragon we mostly resize them to size 1, making them quite often negligible in memory use. We could eventually do like cilium/cilium but one can argue that we would mostly gain on reducing the number of map used by one program to avoid reaching the limit (64) than on actual memory used.

but others can take a great amount of memory.

We also have an equivalent command for map and their memory use that might help you there, check out tetra debug maps --help. I also did a bit of research on that to dig on BPF memory use that could be useful to you.

The reported memlock for each policy is around 8 MB.

Because the size of the map must be statically set at loading time, sometime we just use arbitrary constants that we think would fit most use cases. But many times it's impossible to find that fits all so the solution usually is to provide the ability to resize those with config flags. We also had the idea of proposing "sets of size" for maps fitting use cases for people running small/medium/large number of workloads/policies for example, that would avoid them to set each maps size one by one. Anyway, having users trying to scale Tetragon would greatly improve this area of fine tuning those numbers and reducing overall BPF map memory use and we could investigate on the specific maps you mentioned in a follow up.

mtardy avatar Oct 14 '25 16:10 mtardy

cc @kkourt as I know (with other people) he's been digging into scaling the number of policies recently as well.

mtardy avatar Oct 14 '25 16:10 mtardy

Thank you very much for the quick feedback!

I agree with you that the objective should be to reduce the number of eBPF maps and potentially the number of eBPF programs.

However, this raises a question about the current model: is using generic sensors truly the right way to model the use case of one distinct policy per workload?

Currently, we need to create a dedicated policy for each workload because each requires specifying different values (e.g., allowed binary paths), but the enforcement logic itself is identical across all these policies. The entire process would be significantly simpler if there was a way to define a common enforcement skeleton referenced by all workloads, where each individual workload only supplies the required configuration values.

A possible abstraction to achieve this separation could be as follows:

[!NOTE] Please note this is not a proposal, but just an easy way to explain the idea. For example, instead of using 2 new CRDs, we could use some fields in the existing TracingPolicy CR, like the options one

We could have two custom resources:

  1. ForEachWorkloadPolicy: Defines the shared, unique enforcement logic (the "skeleton"). This resource is deployed once.
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicy
metadata:
  name: "block-not-allowed-process"
spec:
  # This is the hook we want to instrument once
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0 
        operator: "NotEqual" 
        values: [] # The actual values will be supplied by the 'Values' CRD
      matchActions:
      - action: Override
        argError: -1
  1. ForEachWorkloadPolicyValues: Deploys the configuration values and selects the specific pods/cgroups to which they apply.
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
  name: "block-not-allowed-process-my-deployment-1" 
spec:
  # Reference to the policy (a unique ID, not just name, should be used for robustness)
  refPolicy: "block-not-allowed-process" 
  # Select the pods in the workload
  podSelector:
    matchLabels:
      app: "my-deployment-1"
  values: 
    - "/usr/bin/sleep"
    - "/usr/bin/cat"
    - "/usr/bin/my-server-1"
---
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
  name: "block-not-allowed-process-my-deployment-2" 
spec:
  refPolicy: "block-not-allowed-process" 
  podSelector:
    matchLabels:
      app: "my-deployment-2"
  values: 
    - "/usr/bin/ls"
    - "/usr/bin/my-server-2"

On the eBPF side, the ForEachWorkloadPolicy would be responsible for loading and attaching a unique eBPF program. Each ForEachWorkloadPolicyValues resource would then populate a map to associate a given cgroup ID with its unique set of filters.

__attribute__((section("fmod_ret/security_bprm_creds_for_exec"), used)) long
per_workload_fmodret_override(void *ctx)
{
    // pseudo code to explain the idea

    // 1. Get the cgroup ID of the current process
    cgroupid = tg_get_current_cgroup_id();

    // 2. Retrieve filters associated with that cgroup ID
    // The map will be populated by each new `ForEachWorkloadPolicyValue`
    // map[cgroupid] -> filters_map_id

    // 3. Perform the enforcement based on the retrieved filters
    if (match){
      return 0;
    }
    // The error is defined once in the policy definition CRD
    return error; 
}

While I would prefer to leverage the existing generic sensors model, I believe it would be genuinely difficult to achieve this level of logic/value separation and resource optimization using that model. I'm very curious to hear your opinions on this. Do you think it is possible to achieve something similar with the current generic sensors model?

Andreagit97 avatar Oct 15 '25 10:10 Andreagit97

The entire process would be significantly simpler if there was a way to define a common enforcement skeleton referenced by all workloads, where each individual workload only supplies the required configuration values.

We could probably do this by extending the existing policy filter, which already stores cgroup->policy mappings, to include another mapping for cgroupid->values?

dwindsor avatar Oct 15 '25 20:10 dwindsor

For the memory utilization, bpf.BPF_F_NO_PREALLOC option should reduce the memory utilization a lot from ebpf maps, but apparently Tetragon only enabled bpf.BPF_F_NO_PREALLOC on some of maps. I wonder if this is on purpose?

I understand that bpf.BPF_F_NO_PREALLOC might not be ideal due to https://github.com/torvalds/linux/commit/94dacdbd5d2d , but it doesn't explain why those flags are still there.

@kkourt @mtardy not sure if you know some background?

holyspectral avatar Oct 16 '25 14:10 holyspectral

For the memory utilization, bpf.BPF_F_NO_PREALLOC option should reduce the memory utilization a lot from ebpf maps, but apparently Tetragon only enabled bpf.BPF_F_NO_PREALLOC on some of maps. I wonder if this is on purpose?

I understand that bpf.BPF_F_NO_PREALLOC might not be ideal due to torvalds/linux@94dacdbd5d2d , but it doesn't explain why those flags are still there.

@kkourt @mtardy not sure if you know some background?

This is a tool that can be leveraged for reducing memory use of maps but:

  1. It doesn't apply on all the maps
  2. We could argue that it mostly delays memory consumption instead of reducing it

I would use it in the last steps of trying to tune memory use of those tbh. If you think you will gain memory long term by having NO_PREALLOC, your map is certainly just too large and could be resized.

I think if we want to tackle the BPF map memory use problem again we should:

  1. Verify if it's an actual issue.
  2. Check which maps are the biggest consumers.
  3. (hopefully we already fix this) If the map is not used because the feature is not enabled, resize it to 1 to minimize impact.
  4. Resize the maps to a better size, add a flag for resizing or group that size along other maps.
  5. Then consider using NO_PREALLOC to optimize startup memory / situations in which we don't want to scale.

mtardy avatar Oct 16 '25 15:10 mtardy

@mtardy thanks for the prompt and detailed response! Yes it probably doesn't fit all the maps. Our focus for now is the policy_filter_maps and its inner map.

In our use case, we would like to have policies defined via TracingPolicyNamespaced and TracingPolicy CRs. The problem is that, we don't really know how many there will be because it varies in different clusters, until those CRs are created. Without the policy_filter_maps dynamically allocatable, we would end up with many rounds of tuning in order to find a good maximum size of the map and over-provisioning.

That's why I think BPF_F_NO_PREALLOC might be a good fit here. I hope this makes sense and we would love to contribute to improve the scalability of the policy_filter_maps.

holyspectral avatar Oct 16 '25 15:10 holyspectral

@Andreagit97 thanks for posting this issue! The scalability problem that you are describing is definitely something we are aware of, and have been thinking about :)

I'll try to find some time to write more of my thoughts down, but before that I wanted to note two things.

First, we can decouple the policy specification from the underlying implementation of the BPF programs. For what it's worth, the current policies are very close to the BPF code, but they don't have to be. So, in principle, the implementation you are describing could happen with the existing policy scheme (unless I'm missing something).

Second, if we assume generic pod selectors, multiple policies might match the same workload. This means, that if we maintain per-workloads maps we would need to combine multiple policies to determine their contents, and we would need to figure out what means for the map contents when a policy is deleted or added.

kkourt avatar Oct 16 '25 16:10 kkourt

Thank you for the quick feedback!

First, we can decouple the policy specification from the underlying implementation of the BPF programs. For what it's worth, the current policies are very close to the BPF code, but they don't have to be. So, in principle, the implementation you are describing could happen with the existing policy scheme (unless I'm missing something).

Yeah, I agree, if we can reuse the current policy scheme to generate a slightly different eBPF instrumentation that would be great.

Second, if we assume generic pod selectors, multiple policies might match the same workload. This means, that if we maintain per-workloads maps we would need to combine multiple policies to determine their contents, and we would need to figure out what means for the map contents when a policy is deleted or added.

In my above example, I imagine a shared tracing policy (a sort of security profile) where podSelector are mutually exclusive. So each workload take advantage of the shared skeleton to just adding the values. So the ebpf map belongs to the profile (what i called ForEachWorkloadPolicy) rather than single workloads.

{key: cgroupid, value: "hashset of allowed values"} 
# here each cgroupid has a unique entry because podSelectors are mutually exclusive

So if in my cluster I enforce 3 security profiles (e.g., allowed processes, allowed network connections, allowed file accesses, etc), I imagine the following scenario:

  • [allowed processes] a unique fmod_ret eBPF prog attached on security_bprm_creds_for_exec that does the dispatching according to the groupid of the current process
{key: cgroupid-1, value: "/usr/bin/sleep,/usr/bin/cat"}
{key: cgroupid-2, value: "/usr/bin/ls"}
...
  • [allowed file accesses] a unique fmod_ret eBPF prog attached on security_file_open that does the dispatching according to the groupid of the current process
  • same for the networking use case

Of course, this is again a high-level picture of the ideal scenario I imagine. Instructing Tetragon to do that is far from easy. But maybe you have something different in mind, where a per-workload policy fits better with the use case.

Andreagit97 avatar Oct 17 '25 12:10 Andreagit97

In my above example, I imagine a shared tracing policy (a sort of security profile) where podSelector are mutually exclusive.

I guess my point was that that's not how pod selectors typically work (both from a semantics perspective and also from a user perspective). So if have this as a requirement, it should be explicit and, IMO, reflected in the syntax.

kkourt avatar Oct 17 '25 13:10 kkourt

Hi all, just a quick update! We are making efforts to minimize the memory footprint per policy (see https://github.com/cilium/tetragon/issues/4210, https://github.com/cilium/tetragon/issues/4204, and others coming soon).

The point is that even if we might reach an acceptable level of memory for each policy (1/2 MB), there is still the open question on the number of ebpf progs attached to the same kernel function. I might be overestimating the issue, but I’m concerned that placing more than 200 programs on a kernel function in the hot path could lead to a system slowdown.

One idea we had to use just one ebpf prog could be the following.

We can introduce a new operator or a new way to express the the ForEachCgroup constraint in the policy filter.

    - matchArgs:
      - index: 0 
        operator: "EqForEachCgroup" 
        values: [] 
    # Or
    - matchArgs:
      - index: 0 
        forEachCgroup: true
        operator: "Eq" 
        values: [] 
    # Or
    - matchArgs:
      - index: 0 
        operator: "Eq" 
        values: ["*"] 
    # Or whatever...

Let's say we create a policy like this:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "block-not-allowed-process"
spec:
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0
        operator: "EqForEachCgroup"
        values: [] # The actual values will be supplied by other policies deployed later
      matchActions:
      - action: Override
        argError: -1

Let's consider, for now, the case of a linux_binprm arg type. When we evaluate the EqForEachCgroup operator instead of putting string values into BPF_MAP_TYPE_ARRAY_OF_MAPS like we do here https://github.com/cilium/tetragon/blob/27c9abe39c448c6f823c607eaad483d8c5717ecb/bpf/process/string_maps.h#L66

we could create string_maps_0,string_maps_1, ... as BPF_MAP_TYPE_HASH_OF_MAPS. This would allow us to use the cgroup id as a key and as a value the hashset of strings associated with that cgroup.

{
  outer_key: cgroup_id, 
  outer_value: {
    "/usr/bin/ls": ""
    "/usr/bin/sleep": ""
  }
}

Ideally the outer_key should be a sort_of "workload_id" since multiple cgroups will have the same set of strings to match (e.g. all the pods that belongs to the same deployment). So ideally we should have a first map cgroup_id -> workload_id and then the BPF_MAP_TYPE_HASH_OF_MAPS where the outer_key is "workload_id".

When we evaluate this filter at runtime instead of using fixed indexes https://github.com/cilium/tetragon/blob/27c9abe39c448c6f823c607eaad483d8c5717ecb/bpf/process/types/basic.h#L725 we get the cgroupid like we do here https://github.com/cilium/tetragon/blob/27c9abe39c448c6f823c607eaad483d8c5717ecb/bpf/process/policy_filter.h#L62 and we obtain the correct hashset to use for the comparison using our BPF_MAP_TYPE_HASH_OF_MAPS map.

This first tracing policy only sets up the "skeleton; at the beginning, the BPF_MAP_TYPE_HASH_OF_MAPS will be completely empty.

We now have to provide values. At the moment, we haven't found a better way to do that, so the idea is still to use a custom CR to do that

apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
  name: "block-not-allowed-process-my-deployment-1" 
spec:
  refPolicy: "block-not-allowed-process" 
  selector:
    matchLabels:
      app: "my-deployment-1"
  values: 
    - "/usr/bin/sleep"
    - "/usr/bin/cat"
    - "/usr/bin/my-server-1"

When this CR is deployed, the BPF_MAP_TYPE_HASH_OF_MAPS will be populated with the right entries cgroup_id -> hash_set. To do that, we should probably reuse the logic of the policyState.

WDYT about this idea? Any idea/suggestion?

Andreagit97 avatar Oct 21 '25 13:10 Andreagit97

I still feel that this does not address my concern in https://github.com/cilium/tetragon/issues/4191#issuecomment-3415576691.

What happens if the user writes:

apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
  name: "values-1" 
spec:
  refPolicy: "block-not-allowed-process" 
  selector:
    matchLabels:
      app: "my-deployment-1"
  values: 
    - "/usr/bin/sleep"
    - "/usr/bin/cat"
    - "/usr/bin/my-server-1"
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
  name: "values-2" 
spec:
  refPolicy: "block-not-allowed-process" 
  selector:
    matchLabels:
      type: "type-2"
  values: 
    - "/usr/bin/sleep"
    - "/usr/bin/cat"
    - "/usr/bin/my-server-2"

kkourt avatar Nov 03 '25 08:11 kkourt

I'll try to find some time to write more of my thoughts down, but before that I wanted to note two things.

Another approach for solving the same issue would be to rely on tail calls. The idea would be that we would still load one program per policy, but we would only load one program per hook. We can maintain a mapping in a BPF map from workload id -> [policy id], iterate over all policy ids that match a workload and tail call into the corresponding per-policy program.

@tpapagian has already done an implementation for this, so it's definitely possible.

Note that a benefit of this approach is that it works with existing CRDs without any modification.

kkourt avatar Nov 03 '25 08:11 kkourt

As commented on https://github.com/cilium/tetragon/pull/4279#issuecomment-3479398152, I would suggest writing a CFP for this (https://github.com/cilium/design-cfps). We can enumerate the different approaches both from the interface side (CRDs) but also the implementation. Indeed, I would suggest decoupling the two (design and implementation). For example, we could, in theory, implement your suggestion where the maps are indexed based on the workload without having to introduce a new CRD.

@Andreagit97 would you be interested in co-writing such a CFP? I think it would help make the discussion more concrete and identify the tradeoffs.

kkourt avatar Nov 03 '25 09:11 kkourt

I still feel that this does not address my concern in #4191 (comment).

What happens if the user writes:

...

I suppose that in your example, there is at least one pod that has both labels app: "my-deployment-1" and type: "type-2".

In this case, what happens in the current PoC is that Tetragon logs a warning and overwrites the previous policy with the last one deployed https://github.com/cilium/tetragon/pull/4279/files#diff-4f5ac7f1374ee1c6d614acdce14fefb90671f919074598713db4be746ecbbe1cR78

I have to say that is not the intended usage. The ForEachWorkloadPolicyValues resource just specifies the values for cgroups, so a single cgroup shouldn't have more than one ForEachWorkloadPolicyValues associated, otherwise it would mean that we are associating the same cgroup with multiple values for the same filter.
Let's say cgroup1 is the pod's cgroup involved in the overlap. We would have at the same time

cgroup1 -> ["/usr/bin/sleep", "/usr/bin/cat","/usr/bin/my-server-1"]
cgroup1 -> ["/usr/bin/sleep", "/usr/bin/cat","/usr/bin/my-server-2"]

That is kind of a contradiction. But I see your point, this mutual exclusion is probably not clear to the end user, and maybe not flexible enough for use cases different from this one. For sure, if we want to go down this road, we should be more explicit in this mutual exclusion, or at least highlight more the intended usage both in the CRD and in the documentation.

Andreagit97 avatar Nov 03 '25 16:11 Andreagit97

I'll try to find some time to write more of my thoughts down, but before that I wanted to note two things.

Another approach for solving the same issue would be to rely on tail calls. The idea would be that we would still load one program per policy, but we would only load one program per hook. We can maintain a mapping in a BPF map from workload id -> [policy id], iterate over all policy ids that match a workload and tail call into the corresponding per-policy program.

tpapagian has already done an implementation for this, so it's definitely possible.

Uhm, that sounds really interesting, thank you for pointing this out. Is this something public? If yes, I would love to take a look.

Andreagit97 avatar Nov 03 '25 16:11 Andreagit97

@Andreagit97 would you be interested in co-writing such a CFP? I think it would help make the discussion more concrete and identify the tradeoffs.

Sure, let me take a look at how to do that.

Andreagit97 avatar Nov 03 '25 16:11 Andreagit97

In this case, what happens in the current PoC is that Tetragon logs a warning and overwrites the previous policy with the last one deployed

I would argue that ordering is not a reliable way to disambiguate behavior. For example, one agent might receive the policy CRs in one order, while a different agent in a different node receives them in another order. This would result in two agents having different behaviors, which is undesired behaviour IMO.

But I see your point, this mutual exclusion is probably not clear to the end user, and maybe not flexible enough for use cases different from this one. For sure, if we want to go down this road, we should be more explicit in this mutual exclusion, or at least highlight more the intended usage both in the CRD and in the documentation.

In my opinion, if we indeed go down that road (i.e., the road of templates being mutually exclusive for workloads), we should reflect that in the policy constructs so that it is not possible (or at least really hard) to write a policy with conflicts.

Just to provide some first ideas of what this could look like:

Having something like:

kind: TracingPolicyTemplate
spec:
  podSelector:
      # Specify keys
      matchLabelKeys:
       - "app"
kind: TracingPolicyTemplateValues
spec:
  podSelector:
      # Only equality, and should match whatever is defined in template's `matchLabelKeys`
      matchLabels:
         app: "pizza"

is a step in that direction, but it still leaves the possibility for two policies to have the same labels. One approach there would be to merge the values specified by both policies (although this introduces some challenges in how we maintain the BPF maps).

A more extreme approach would be something like:

kind: TracingPolicyTemplate
metadata:
   name: policy1
spec:
  podSelector:
      # Specify that the (single) label key is "app"
      matchLabelKey: app
kind: TracingPolicyTemplateValues
metadata:
   # the name is <template>-<label-key> so it is guaranteed to be unique
    name: policy1-pizza
spec:

This makes it impossible to write something that has conflicts: because the name is unique, and the label key is derived in the name, there cannot be two policies that match the same workload.

That being said, if we could address the scalability issue in a way that we allow workloads to be matched by multiple policies, then it would serve additional use-cases (than the mutual exclusion one) and it would be much closer to what a k8s user would expect.

Is this something public? If yes, I would love to take a look.

Not yet, but we can add the details in the CFP (see below).

@Andreagit97 would you be interested in co-writing such a CFP? I think it would help make the discussion more concrete and identify the tradeoffs.

Sure, let me take a look at how to do that.

Not sure what the proper process would be, but maybe something like https://github.com/kkourt/tetragon-scalability-cfp would work? I've sent an invitation to the repo, and we can, of course, add other folks that are interested in contributing.

If the above approach (that is, repo or co-write CFP) does not work for whatever reason, I'm happy to find an approach that works. My main requirement would be to reach a good understanding of what the different approaches are to address the scalability problem and what the tradeoffs between them are. How we get there is very much up for discussion.

kkourt avatar Nov 05 '25 08:11 kkourt

Not sure what the proper process would be, but maybe something like https://github.com/kkourt/tetragon-scalability-cfp would work? I've sent an invitation to the repo, and we can, of course, add other folks that are interested in contributing.

Thank you for this! I was in the process of creating a PR against the repo https://github.com/cilium/design-cfps/compare/main...Andreagit97:design-cfps:tetragon-workload-policies, but we can use your fork since you created it. How should we interact there? just pushing on the main branch with some decent criteria --force-with-lease or should we open PRs against the main branch? I was taking inspiration from other CFPs (e.g., https://github.com/cilium/design-cfps/pull/76), and I saw that usually the conversation goes on in PR comments rather than in commits, so not sure about the strategy to follow here to maximize the interaction.

A more extreme approach would be something like:

kind: TracingPolicyTemplate metadata: name: policy1 spec: podSelector: # Specify that the (single) label key is "app" matchLabelKey: app

kind: TracingPolicyTemplateValues metadata: name: policy1-pizza spec:

This makes it impossible to write something that has conflicts: because the name is unique, and the label key is derived in the name, there cannot be two policies that match the same workload.

Yes, this could be an option. On one side, we gain the mutual exclusion by design; on the other, we require each workload to have some specific labels. For example, if in my cluster I want to enforce a template shared by all the workloads, each one should have the app label with a different value, so there is more effort on the user/automation that needs to create these labels, but this could be a fair price to pay.

That being said, if we could address the scalability issue in a way that we allow workloads to be matched by multiple policies, then it would serve additional use-cases (than the mutual exclusion one) and it would be much closer to what a k8s user would expect.

I agree; this per-workload policy use-case can be part of a more generic feature. So the mutual exclusion could be just a way to use the feature, but not a compulsory requirement. I’m curious to understand whether the tail call–based approach could offer such flexibility.

mutual exclusion could be something required by only some use-cases, but it shouldn’t be enforced as a strict requirement. I’m curious to know whether the tail call–based approach could offer such flexibility.

Andreagit97 avatar Nov 05 '25 10:11 Andreagit97

Not sure what the proper process would be, but maybe something like https://github.com/kkourt/tetragon-scalability-cfp would work? I've sent an invitation to the repo, and we can, of course, add other folks that are interested in contributing.

Thank you for this! I was in the process of creating a PR against the repo cilium/[email protected]:design-cfps:tetragon-workload-policies, but we can use your fork since you created it. How should we interact there? just pushing on the main branch with some decent criteria --force-with-lease or should we open PRs against the main branch? I was taking inspiration from other CFPs (e.g., cilium/design-cfps#76), and I saw that usually the conversation goes on in PR comments rather than in commits, so not sure about the strategy to follow here to maximize the interaction.

Maybe working on a PR would be better. Would I be able to do PRs on your branch? If so, maybe:

  • You maintain a PR against cilium/design-cfps
  • Other folk (e.g., myself) can comment on the PR
  • Other folk can raise PRs on your fork with updates that are too big to be made into comments. For example, I can do a PR adding a description for the tail call approach.

Does this work?

kkourt avatar Nov 05 '25 13:11 kkourt

Yes, thank you! I've opened the PR https://github.com/cilium/design-cfps/pull/80 and added you as a collaborator; you should have received the invite. Being a collaborator should be enough to open a PR against the branch

Andreagit97 avatar Nov 05 '25 16:11 Andreagit97

Just a quick recap (the following is tailored to our use case discussed previously in this issue):

Where we started

The 3 main issues we faced until now for our use case:

  1. The maximum number of policies we can create on a node is limited. Due to the eBPF programs we use (fmod_ret), this limit is 38 policies for each node. Even if we change the EBPF program type (kprobe+sigkill), there is a hardcoded limit of 128 policies per node inside Tetragon.
  2. Each policy deployed in the cluster requires ~9 MB (on a node with 16 CPUs). This is mainly due to the eBPF maps used by the policy.
  3. Each policy attaches 2 eBPF programs on the same hook in the kernel (kprobe + fmod_ret)

Current situation

  1. We should be able to overcome the number of policy limits with:

    • https://github.com/cilium/tetragon/pull/4244 (one unique fmod_ret prog for our use case)
    • https://github.com/cilium/tetragon/pull/4331
  2. We should reach ~2MB per policy on a machine with 16 CPUs with:

    • https://github.com/cilium/tetragon/pull/4211 (reduce size of socktrack_map)
    • https://github.com/cilium/tetragon/pull/4340 (usage of BPF_F_NO_PREALLOC)
    • https://github.com/cilium/tetragon/pull/4244 (shares the same override_tasks between our policies)

    Unfortunately, there are several maps inside Tetragon that depends on the number of CPUs. Here are some of them:

    • process_call_heap → allocates 25612 bytes for each CPU
    • ratelimit_heap → 356 * ncpu
    • buffer_heap_map → 4356 * ncpu
    • heap → 4108 *ncpu
    • string_postfix_ → 136 * ncpu
    • string_prefix_m → 264 * ncpu
    • tg_ipv6_ext_heap → 16 * ncpu
    • string_maps_heap → 16388 * ncpu
    • data_heap → 32772 * ncpu

    This means that, to a base of ~ 0.6 MB (independent from the number of CPUs), we need to add 84008 bytes (sum of the above) for each CPU.

    • 16 CPUs → 84008 B * 16 + 0.6 MB =~ 1.9 MB
    • 96 CPUs → 84008 B * 96 + 0.6 MB =~ 8.3 MB
  3. With https://github.com/cilium/tetragon/pull/4244, we will have just one ebpf prog for each policy, fmod_ret prog will be shared among policies. Unfortunately, one EBPF program for each policy is still too much for our use case

Our Mitigation

Given the memory overhead and the number of eBPF programs, we decided to switch to a custom implementation that injects just one eBPF program for all policies, and each policy just populates a map, binding itself to the involved cgroups (very similar to what we did here https://github.com/cilium/tetragon/pull/4279). Right now, our unique use case is enforcing binary execution paths on each container, so one unique fmod_ret prog on security_bprm_creds_for_exec is more than enough. While this solves our needs for now, it's clear that if in the future, we decide to expand our use cases, we will end up with the same architectural challenges we faced in this issue. For this reason, I believe this CFP is still valid and useful for future https://github.com/cilium/design-cfps/pull/80

Andreagit97 avatar Dec 19 '25 16:12 Andreagit97

Just a quick recap (the following is tailored to our use case discussed previously in this issue):

Thanks a lot that's useful to have that kind of recap, please continue, I see the PRs are making progress slowly and eventually Tetragon TracingPolicy might scale better with the CFP in progress. Thanks for writing it down.

mtardy avatar Dec 22 '25 15:12 mtardy