Scaling k8s workload aware tracing policies
Hi all! We would like to use Tetragon to implement per-workload runtime security policies across a Kubernetes cluster. The goal is to establish a "fingerprint" of allowed behavior for every Kubernetes workload (Deployment, StatefulSet, DaemonSet), starting with the strict enforcement of which processes each workload is permitted to spawn.
Let's say in our cluster we have two deployments, my-deployment-1 and my-deployment-2, and we want to enforce the following policies:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "policy-1"
spec:
podSelector:
matchLabels:
app: "my-deployment-1"
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-1"
matchActions:
- action: Override
argError: -1
options:
- name: disable-kprobe-multi
value: "1"
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "policy-2"
spec:
podSelector:
matchLabels:
app: "my-deployment-2"
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values:
- "/usr/bin/ls"
- "/usr/bin/my-server-2"
matchActions:
- action: Override
argError: -1
options:
- name: disable-kprobe-multi
value: "1"
Let's see what Tetragon injects into the kernel today.
eBPF prog point of view
The above 2 policies will result in the following ebpf programs being attached to the security_bprm_creds_for_exec function:
generic_kprobe_event(from policy-1) -> other generic kprobe called in tail callgeneric_kprobe_event(from policy-2) -> other generic kprobe called in tail callgeneric_fmodret_override(from policy-1)generic_fmodret_override(from policy-2)
Of course, the number of progs will grow linearly with the number of policies (and so k8s workloads in our use case). When the number of policies grows, we hit the following limits:
- The first issue we face is the number of programs we can attach to the same hook. In particular, we have a limit of 38 progs if we use
BPF_MODIFY_RETURN. This type of program relies on eBPF trampoline and is subject to theBPF_MAX_TRAMP_LINKSlimit (38 on x86). https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138. - Let's say we overcome this issue using
kprobes + sigkill, now we hit a second limit of 128 policies. This limit is hardcoded in tetragon code https://github.com/cilium/tetragon/blob/47538a07a4e6c51a9cc569f78c42a2cf767c5405/bpf/process/policy_filter.h#L23 probably to take care of memory usage. We can probably overcome this limit as well by making the limit configurable. - Now, the third issue that I think we cannot overcome today is performance overhead. The list of attached programs grows linearly with the number of policies we create. If we have 500 workloads in the cluster, we will have 500 programs attached to the same function. This could lead to a noticeable system slowdown when a new process is created. The slowdown could be even more relevant if we extend this behavior to some other kernel subsystems (e.g., file system/network operations).
eBPF maps point of view
For each of the above policies, I see more or less 50 eBPF maps loaded. Most of them have just 1 entry because they are probably not used, but others can take a great amount of memory. The reported memlock for each policy is around 8 MB. The most memory-intensive maps seem to be:
// inner map for each loaded policy with pod selectors
721: hash name policy_1_map flags 0x0
key 8B value 1B max_entries 32768 memlock 2624000B
pids tetragon(63603)
// Still need to check if this is really needed (?)
764: lru_hash name socktrack_map flags 0x0
key 8B value 16B max_entries 32000 memlock 2829696B
btf_id 947
pids tetragon(63603)
// map used for overriding the return value
766: hash name override_tasks flags 0x0
key 8B value 4B max_entries 32768 memlock 2624000B
btf_id 949
pids tetragon(63603)
As you may imagine also in this case, having 500 deployments in the cluster could lead to a significant memory usage on the node (8 MB* 500 = 4 GB)
Summary
With this issue, we just want to highlight the current limitation in scalability that we are facing. I would love your feedback on this. Do you see any mistakes in this analysis? I'm pretty new to Tetragon, so maybe I missed something, and there is a way to overcome some of the above limitations that I didn't consider. If you confirm these are real limitations and you are interested in supporting this use case, we can maybe discuss possible ideas to address them.
Thank you for your time!
Hey 👋 thanks for opening this issue. Let me give you a first answer to some of the topics here.
On programs
Of course, the number of progs will grow linearly with the number of policies (and so k8s workloads in our use case). When the number of policies grows, we hit the following limits:
Indeed, because those are generic sensors, they are wired to be able to perform many different things. And thus you'll see them systematically attached to hooking points specified by policy. However, the generic programs should be written with in mind that the checks to see if the policy apply or not should be done early to minimize overhead on "non-matching events" (like workloads that doesn't match the k8s label for example). All of that to say that it's the downside of the program being generic and programmed by map values (but at the same time they provide great programmability).
We have debug commands that might help you there, check out tetra debug progs --help.
With more concrete analysis of a specific use case, we should be able to improve efficiency if we can spot "abnormal" overhead. Maybe also in the number of progs.
On maps
For each of the above policies, I see more or less 50 eBPF maps loaded. Most of them have just 1 entry because they are probably not used
Indeed, there's been an effort previously to correctly resize unused map before loading them (see for example https://github.com/cilium/tetragon/pull/2546, https://github.com/cilium/tetragon/pull/2551, https://github.com/cilium/tetragon/pull/2555, https://github.com/cilium/tetragon/pull/2563 or https://github.com/cilium/tetragon/pull/2692, etc.). It is actually tricky to completely remove the unused maps, Cilium give it a go and found a way (see https://github.com/cilium/cilium/pull/40416) but for Tetragon we mostly resize them to size 1, making them quite often negligible in memory use. We could eventually do like cilium/cilium but one can argue that we would mostly gain on reducing the number of map used by one program to avoid reaching the limit (64) than on actual memory used.
but others can take a great amount of memory.
We also have an equivalent command for map and their memory use that might help you there, check out tetra debug maps --help. I also did a bit of research on that to dig on BPF memory use that could be useful to you.
The reported memlock for each policy is around 8 MB.
Because the size of the map must be statically set at loading time, sometime we just use arbitrary constants that we think would fit most use cases. But many times it's impossible to find that fits all so the solution usually is to provide the ability to resize those with config flags. We also had the idea of proposing "sets of size" for maps fitting use cases for people running small/medium/large number of workloads/policies for example, that would avoid them to set each maps size one by one. Anyway, having users trying to scale Tetragon would greatly improve this area of fine tuning those numbers and reducing overall BPF map memory use and we could investigate on the specific maps you mentioned in a follow up.
cc @kkourt as I know (with other people) he's been digging into scaling the number of policies recently as well.
Thank you very much for the quick feedback!
I agree with you that the objective should be to reduce the number of eBPF maps and potentially the number of eBPF programs.
However, this raises a question about the current model: is using generic sensors truly the right way to model the use case of one distinct policy per workload?
Currently, we need to create a dedicated policy for each workload because each requires specifying different values (e.g., allowed binary paths), but the enforcement logic itself is identical across all these policies. The entire process would be significantly simpler if there was a way to define a common enforcement skeleton referenced by all workloads, where each individual workload only supplies the required configuration values.
A possible abstraction to achieve this separation could be as follows:
[!NOTE] Please note this is not a proposal, but just an easy way to explain the idea. For example, instead of using 2 new CRDs, we could use some fields in the existing
TracingPolicyCR, like theoptionsone
We could have two custom resources:
ForEachWorkloadPolicy: Defines the shared, unique enforcement logic (the "skeleton"). This resource is deployed once.
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicy
metadata:
name: "block-not-allowed-process"
spec:
# This is the hook we want to instrument once
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values: [] # The actual values will be supplied by the 'Values' CRD
matchActions:
- action: Override
argError: -1
ForEachWorkloadPolicyValues: Deploys the configuration values and selects the specific pods/cgroups to which they apply.
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
name: "block-not-allowed-process-my-deployment-1"
spec:
# Reference to the policy (a unique ID, not just name, should be used for robustness)
refPolicy: "block-not-allowed-process"
# Select the pods in the workload
podSelector:
matchLabels:
app: "my-deployment-1"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-1"
---
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
name: "block-not-allowed-process-my-deployment-2"
spec:
refPolicy: "block-not-allowed-process"
podSelector:
matchLabels:
app: "my-deployment-2"
values:
- "/usr/bin/ls"
- "/usr/bin/my-server-2"
On the eBPF side, the ForEachWorkloadPolicy would be responsible for loading and attaching a unique eBPF program. Each ForEachWorkloadPolicyValues resource would then populate a map to associate a given cgroup ID with its unique set of filters.
__attribute__((section("fmod_ret/security_bprm_creds_for_exec"), used)) long
per_workload_fmodret_override(void *ctx)
{
// pseudo code to explain the idea
// 1. Get the cgroup ID of the current process
cgroupid = tg_get_current_cgroup_id();
// 2. Retrieve filters associated with that cgroup ID
// The map will be populated by each new `ForEachWorkloadPolicyValue`
// map[cgroupid] -> filters_map_id
// 3. Perform the enforcement based on the retrieved filters
if (match){
return 0;
}
// The error is defined once in the policy definition CRD
return error;
}
While I would prefer to leverage the existing generic sensors model, I believe it would be genuinely difficult to achieve this level of logic/value separation and resource optimization using that model. I'm very curious to hear your opinions on this. Do you think it is possible to achieve something similar with the current generic sensors model?
The entire process would be significantly simpler if there was a way to define a common enforcement skeleton referenced by all workloads, where each individual workload only supplies the required configuration values.
We could probably do this by extending the existing policy filter, which already stores cgroup->policy mappings, to include another mapping for cgroupid->values?
For the memory utilization, bpf.BPF_F_NO_PREALLOC option should reduce the memory utilization a lot from ebpf maps, but apparently Tetragon only enabled bpf.BPF_F_NO_PREALLOC on some of maps. I wonder if this is on purpose?
I understand that bpf.BPF_F_NO_PREALLOC might not be ideal due to https://github.com/torvalds/linux/commit/94dacdbd5d2d , but it doesn't explain why those flags are still there.
@kkourt @mtardy not sure if you know some background?
For the memory utilization,
bpf.BPF_F_NO_PREALLOCoption should reduce the memory utilization a lot from ebpf maps, but apparently Tetragon only enabledbpf.BPF_F_NO_PREALLOCon some of maps. I wonder if this is on purpose?I understand that
bpf.BPF_F_NO_PREALLOCmight not be ideal due to torvalds/linux@94dacdbd5d2d , but it doesn't explain why those flags are still there.
This is a tool that can be leveraged for reducing memory use of maps but:
- It doesn't apply on all the maps
- We could argue that it mostly delays memory consumption instead of reducing it
I would use it in the last steps of trying to tune memory use of those tbh. If you think you will gain memory long term by having NO_PREALLOC, your map is certainly just too large and could be resized.
I think if we want to tackle the BPF map memory use problem again we should:
- Verify if it's an actual issue.
- Check which maps are the biggest consumers.
- (hopefully we already fix this) If the map is not used because the feature is not enabled, resize it to 1 to minimize impact.
- Resize the maps to a better size, add a flag for resizing or group that size along other maps.
- Then consider using
NO_PREALLOCto optimize startup memory / situations in which we don't want to scale.
@mtardy thanks for the prompt and detailed response! Yes it probably doesn't fit all the maps. Our focus for now is the policy_filter_maps and its inner map.
In our use case, we would like to have policies defined via TracingPolicyNamespaced and TracingPolicy CRs. The problem is that, we don't really know how many there will be because it varies in different clusters, until those CRs are created. Without the policy_filter_maps dynamically allocatable, we would end up with many rounds of tuning in order to find a good maximum size of the map and over-provisioning.
That's why I think BPF_F_NO_PREALLOC might be a good fit here. I hope this makes sense and we would love to contribute to improve the scalability of the policy_filter_maps.
@Andreagit97 thanks for posting this issue! The scalability problem that you are describing is definitely something we are aware of, and have been thinking about :)
I'll try to find some time to write more of my thoughts down, but before that I wanted to note two things.
First, we can decouple the policy specification from the underlying implementation of the BPF programs. For what it's worth, the current policies are very close to the BPF code, but they don't have to be. So, in principle, the implementation you are describing could happen with the existing policy scheme (unless I'm missing something).
Second, if we assume generic pod selectors, multiple policies might match the same workload. This means, that if we maintain per-workloads maps we would need to combine multiple policies to determine their contents, and we would need to figure out what means for the map contents when a policy is deleted or added.
Thank you for the quick feedback!
First, we can decouple the policy specification from the underlying implementation of the BPF programs. For what it's worth, the current policies are very close to the BPF code, but they don't have to be. So, in principle, the implementation you are describing could happen with the existing policy scheme (unless I'm missing something).
Yeah, I agree, if we can reuse the current policy scheme to generate a slightly different eBPF instrumentation that would be great.
Second, if we assume generic pod selectors, multiple policies might match the same workload. This means, that if we maintain per-workloads maps we would need to combine multiple policies to determine their contents, and we would need to figure out what means for the map contents when a policy is deleted or added.
In my above example, I imagine a shared tracing policy (a sort of security profile) where podSelector are mutually exclusive. So each workload take advantage of the shared skeleton to just adding the values.
So the ebpf map belongs to the profile (what i called ForEachWorkloadPolicy) rather than single workloads.
{key: cgroupid, value: "hashset of allowed values"}
# here each cgroupid has a unique entry because podSelectors are mutually exclusive
So if in my cluster I enforce 3 security profiles (e.g., allowed processes, allowed network connections, allowed file accesses, etc), I imagine the following scenario:
- [allowed processes] a unique fmod_ret eBPF prog attached on
security_bprm_creds_for_execthat does the dispatching according to the groupid of the current process
{key: cgroupid-1, value: "/usr/bin/sleep,/usr/bin/cat"}
{key: cgroupid-2, value: "/usr/bin/ls"}
...
- [allowed file accesses] a unique fmod_ret eBPF prog attached on
security_file_openthat does the dispatching according to the groupid of the current process - same for the networking use case
Of course, this is again a high-level picture of the ideal scenario I imagine. Instructing Tetragon to do that is far from easy. But maybe you have something different in mind, where a per-workload policy fits better with the use case.
In my above example, I imagine a shared tracing policy (a sort of security profile) where podSelector are mutually exclusive.
I guess my point was that that's not how pod selectors typically work (both from a semantics perspective and also from a user perspective). So if have this as a requirement, it should be explicit and, IMO, reflected in the syntax.
Hi all, just a quick update! We are making efforts to minimize the memory footprint per policy (see https://github.com/cilium/tetragon/issues/4210, https://github.com/cilium/tetragon/issues/4204, and others coming soon).
The point is that even if we might reach an acceptable level of memory for each policy (1/2 MB), there is still the open question on the number of ebpf progs attached to the same kernel function. I might be overestimating the issue, but I’m concerned that placing more than 200 programs on a kernel function in the hot path could lead to a system slowdown.
One idea we had to use just one ebpf prog could be the following.
We can introduce a new operator or a new way to express the the ForEachCgroup constraint in the policy filter.
- matchArgs:
- index: 0
operator: "EqForEachCgroup"
values: []
# Or
- matchArgs:
- index: 0
forEachCgroup: true
operator: "Eq"
values: []
# Or
- matchArgs:
- index: 0
operator: "Eq"
values: ["*"]
# Or whatever...
Let's say we create a policy like this:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "block-not-allowed-process"
spec:
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "EqForEachCgroup"
values: [] # The actual values will be supplied by other policies deployed later
matchActions:
- action: Override
argError: -1
Let's consider, for now, the case of a linux_binprm arg type. When we evaluate the EqForEachCgroup operator instead of putting string values into BPF_MAP_TYPE_ARRAY_OF_MAPS like we do here https://github.com/cilium/tetragon/blob/27c9abe39c448c6f823c607eaad483d8c5717ecb/bpf/process/string_maps.h#L66
we could create string_maps_0,string_maps_1, ... as BPF_MAP_TYPE_HASH_OF_MAPS. This would allow us to use the cgroup id as a key and as a value the hashset of strings associated with that cgroup.
{
outer_key: cgroup_id,
outer_value: {
"/usr/bin/ls": ""
"/usr/bin/sleep": ""
}
}
Ideally the outer_key should be a sort_of "workload_id" since multiple cgroups will have the same set of strings to match (e.g. all the pods that belongs to the same deployment). So ideally we should have a first map cgroup_id -> workload_id and then the BPF_MAP_TYPE_HASH_OF_MAPS where the outer_key is "workload_id".
When we evaluate this filter at runtime instead of using fixed indexes https://github.com/cilium/tetragon/blob/27c9abe39c448c6f823c607eaad483d8c5717ecb/bpf/process/types/basic.h#L725 we get the cgroupid like we do here https://github.com/cilium/tetragon/blob/27c9abe39c448c6f823c607eaad483d8c5717ecb/bpf/process/policy_filter.h#L62
and we obtain the correct hashset to use for the comparison using our BPF_MAP_TYPE_HASH_OF_MAPS map.
This first tracing policy only sets up the "skeleton; at the beginning, the BPF_MAP_TYPE_HASH_OF_MAPS will be completely empty.
We now have to provide values. At the moment, we haven't found a better way to do that, so the idea is still to use a custom CR to do that
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
name: "block-not-allowed-process-my-deployment-1"
spec:
refPolicy: "block-not-allowed-process"
selector:
matchLabels:
app: "my-deployment-1"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-1"
When this CR is deployed, the BPF_MAP_TYPE_HASH_OF_MAPS will be populated with the right entries cgroup_id -> hash_set. To do that, we should probably reuse the logic of the policyState.
WDYT about this idea? Any idea/suggestion?
I still feel that this does not address my concern in https://github.com/cilium/tetragon/issues/4191#issuecomment-3415576691.
What happens if the user writes:
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
name: "values-1"
spec:
refPolicy: "block-not-allowed-process"
selector:
matchLabels:
app: "my-deployment-1"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-1"
apiVersion: cilium.io/v1alpha1
kind: ForEachWorkloadPolicyValues
metadata:
name: "values-2"
spec:
refPolicy: "block-not-allowed-process"
selector:
matchLabels:
type: "type-2"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-2"
I'll try to find some time to write more of my thoughts down, but before that I wanted to note two things.
Another approach for solving the same issue would be to rely on tail calls. The idea would be that we would still load one program per policy, but we would only load one program per hook. We can maintain a mapping in a BPF map from workload id -> [policy id], iterate over all policy ids that match a workload and tail call into the corresponding per-policy program.
@tpapagian has already done an implementation for this, so it's definitely possible.
Note that a benefit of this approach is that it works with existing CRDs without any modification.
As commented on https://github.com/cilium/tetragon/pull/4279#issuecomment-3479398152, I would suggest writing a CFP for this (https://github.com/cilium/design-cfps). We can enumerate the different approaches both from the interface side (CRDs) but also the implementation. Indeed, I would suggest decoupling the two (design and implementation). For example, we could, in theory, implement your suggestion where the maps are indexed based on the workload without having to introduce a new CRD.
@Andreagit97 would you be interested in co-writing such a CFP? I think it would help make the discussion more concrete and identify the tradeoffs.
I still feel that this does not address my concern in #4191 (comment).
What happens if the user writes:
...
I suppose that in your example, there is at least one pod that has both labels app: "my-deployment-1" and type: "type-2".
In this case, what happens in the current PoC is that Tetragon logs a warning and overwrites the previous policy with the last one deployed https://github.com/cilium/tetragon/pull/4279/files#diff-4f5ac7f1374ee1c6d614acdce14fefb90671f919074598713db4be746ecbbe1cR78
I have to say that is not the intended usage. The ForEachWorkloadPolicyValues resource just specifies the values for cgroups, so a single cgroup shouldn't have more than one ForEachWorkloadPolicyValues associated, otherwise it would mean that we are associating the same cgroup with multiple values for the same filter.
Let's say cgroup1 is the pod's cgroup involved in the overlap.
We would have at the same time
cgroup1 -> ["/usr/bin/sleep", "/usr/bin/cat","/usr/bin/my-server-1"]
cgroup1 -> ["/usr/bin/sleep", "/usr/bin/cat","/usr/bin/my-server-2"]
That is kind of a contradiction. But I see your point, this mutual exclusion is probably not clear to the end user, and maybe not flexible enough for use cases different from this one. For sure, if we want to go down this road, we should be more explicit in this mutual exclusion, or at least highlight more the intended usage both in the CRD and in the documentation.
I'll try to find some time to write more of my thoughts down, but before that I wanted to note two things.
Another approach for solving the same issue would be to rely on tail calls. The idea would be that we would still load one program per policy, but we would only load one program per hook. We can maintain a mapping in a BPF map from workload id -> [policy id], iterate over all policy ids that match a workload and tail call into the corresponding per-policy program.
tpapagian has already done an implementation for this, so it's definitely possible.
Uhm, that sounds really interesting, thank you for pointing this out. Is this something public? If yes, I would love to take a look.
@Andreagit97 would you be interested in co-writing such a CFP? I think it would help make the discussion more concrete and identify the tradeoffs.
Sure, let me take a look at how to do that.
In this case, what happens in the current PoC is that Tetragon logs a warning and overwrites the previous policy with the last one deployed
I would argue that ordering is not a reliable way to disambiguate behavior. For example, one agent might receive the policy CRs in one order, while a different agent in a different node receives them in another order. This would result in two agents having different behaviors, which is undesired behaviour IMO.
But I see your point, this mutual exclusion is probably not clear to the end user, and maybe not flexible enough for use cases different from this one. For sure, if we want to go down this road, we should be more explicit in this mutual exclusion, or at least highlight more the intended usage both in the CRD and in the documentation.
In my opinion, if we indeed go down that road (i.e., the road of templates being mutually exclusive for workloads), we should reflect that in the policy constructs so that it is not possible (or at least really hard) to write a policy with conflicts.
Just to provide some first ideas of what this could look like:
Having something like:
kind: TracingPolicyTemplate
spec:
podSelector:
# Specify keys
matchLabelKeys:
- "app"
kind: TracingPolicyTemplateValues
spec:
podSelector:
# Only equality, and should match whatever is defined in template's `matchLabelKeys`
matchLabels:
app: "pizza"
is a step in that direction, but it still leaves the possibility for two policies to have the same labels. One approach there would be to merge the values specified by both policies (although this introduces some challenges in how we maintain the BPF maps).
A more extreme approach would be something like:
kind: TracingPolicyTemplate
metadata:
name: policy1
spec:
podSelector:
# Specify that the (single) label key is "app"
matchLabelKey: app
kind: TracingPolicyTemplateValues
metadata:
# the name is <template>-<label-key> so it is guaranteed to be unique
name: policy1-pizza
spec:
This makes it impossible to write something that has conflicts: because the name is unique, and the label key is derived in the name, there cannot be two policies that match the same workload.
That being said, if we could address the scalability issue in a way that we allow workloads to be matched by multiple policies, then it would serve additional use-cases (than the mutual exclusion one) and it would be much closer to what a k8s user would expect.
Is this something public? If yes, I would love to take a look.
Not yet, but we can add the details in the CFP (see below).
@Andreagit97 would you be interested in co-writing such a CFP? I think it would help make the discussion more concrete and identify the tradeoffs.
Sure, let me take a look at how to do that.
Not sure what the proper process would be, but maybe something like https://github.com/kkourt/tetragon-scalability-cfp would work? I've sent an invitation to the repo, and we can, of course, add other folks that are interested in contributing.
If the above approach (that is, repo or co-write CFP) does not work for whatever reason, I'm happy to find an approach that works. My main requirement would be to reach a good understanding of what the different approaches are to address the scalability problem and what the tradeoffs between them are. How we get there is very much up for discussion.
Not sure what the proper process would be, but maybe something like https://github.com/kkourt/tetragon-scalability-cfp would work? I've sent an invitation to the repo, and we can, of course, add other folks that are interested in contributing.
Thank you for this! I was in the process of creating a PR against the repo https://github.com/cilium/design-cfps/compare/main...Andreagit97:design-cfps:tetragon-workload-policies, but we can use your fork since you created it. How should we interact there? just pushing on the main branch with some decent criteria --force-with-lease or should we open PRs against the main branch? I was taking inspiration from other CFPs (e.g., https://github.com/cilium/design-cfps/pull/76), and I saw that usually the conversation goes on in PR comments rather than in commits, so not sure about the strategy to follow here to maximize the interaction.
A more extreme approach would be something like:
kind: TracingPolicyTemplate metadata: name: policy1 spec: podSelector: # Specify that the (single) label key is "app" matchLabelKey: app
kind: TracingPolicyTemplateValues metadata: name: policy1-pizza spec:
This makes it impossible to write something that has conflicts: because the name is unique, and the label key is derived in the name, there cannot be two policies that match the same workload.
Yes, this could be an option. On one side, we gain the mutual exclusion by design; on the other, we require each workload to have some specific labels. For example, if in my cluster I want to enforce a template shared by all the workloads, each one should have the app label with a different value, so there is more effort on the user/automation that needs to create these labels, but this could be a fair price to pay.
That being said, if we could address the scalability issue in a way that we allow workloads to be matched by multiple policies, then it would serve additional use-cases (than the mutual exclusion one) and it would be much closer to what a k8s user would expect.
I agree; this per-workload policy use-case can be part of a more generic feature. So the mutual exclusion could be just a way to use the feature, but not a compulsory requirement. I’m curious to understand whether the tail call–based approach could offer such flexibility.
mutual exclusion could be something required by only some use-cases, but it shouldn’t be enforced as a strict requirement. I’m curious to know whether the tail call–based approach could offer such flexibility.
Not sure what the proper process would be, but maybe something like https://github.com/kkourt/tetragon-scalability-cfp would work? I've sent an invitation to the repo, and we can, of course, add other folks that are interested in contributing.
Thank you for this! I was in the process of creating a PR against the repo cilium/[email protected]:design-cfps:tetragon-workload-policies, but we can use your fork since you created it. How should we interact there? just pushing on the main branch with some decent criteria
--force-with-leaseor should we open PRs against the main branch? I was taking inspiration from other CFPs (e.g., cilium/design-cfps#76), and I saw that usually the conversation goes on in PR comments rather than in commits, so not sure about the strategy to follow here to maximize the interaction.
Maybe working on a PR would be better. Would I be able to do PRs on your branch? If so, maybe:
- You maintain a PR against cilium/design-cfps
- Other folk (e.g., myself) can comment on the PR
- Other folk can raise PRs on your fork with updates that are too big to be made into comments. For example, I can do a PR adding a description for the tail call approach.
Does this work?
Yes, thank you! I've opened the PR https://github.com/cilium/design-cfps/pull/80 and added you as a collaborator; you should have received the invite. Being a collaborator should be enough to open a PR against the branch
Just a quick recap (the following is tailored to our use case discussed previously in this issue):
Where we started
The 3 main issues we faced until now for our use case:
- The maximum number of policies we can create on a node is limited. Due to the eBPF programs we use (fmod_ret), this limit is 38 policies for each node. Even if we change the EBPF program type (kprobe+sigkill), there is a hardcoded limit of 128 policies per node inside Tetragon.
- Each policy deployed in the cluster requires ~9 MB (on a node with 16 CPUs). This is mainly due to the eBPF maps used by the policy.
- Each policy attaches 2 eBPF programs on the same hook in the kernel (kprobe + fmod_ret)
Current situation
-
We should be able to overcome the number of policy limits with:
- https://github.com/cilium/tetragon/pull/4244 (one unique fmod_ret prog for our use case)
- https://github.com/cilium/tetragon/pull/4331
-
We should reach ~2MB per policy on a machine with 16 CPUs with:
- https://github.com/cilium/tetragon/pull/4211 (reduce size of socktrack_map)
- https://github.com/cilium/tetragon/pull/4340 (usage of BPF_F_NO_PREALLOC)
- https://github.com/cilium/tetragon/pull/4244 (shares the same
override_tasksbetween our policies)
Unfortunately, there are several maps inside Tetragon that depends on the number of CPUs. Here are some of them:
- process_call_heap → allocates 25612 bytes for each CPU
- ratelimit_heap → 356 * ncpu
- buffer_heap_map → 4356 * ncpu
- heap → 4108 *ncpu
- string_postfix_ → 136 * ncpu
- string_prefix_m → 264 * ncpu
- tg_ipv6_ext_heap → 16 * ncpu
- string_maps_heap → 16388 * ncpu
- data_heap → 32772 * ncpu
This means that, to a base of ~ 0.6 MB (independent from the number of CPUs), we need to add 84008 bytes (sum of the above) for each CPU.
- 16 CPUs → 84008 B * 16 + 0.6 MB =~ 1.9 MB
- 96 CPUs → 84008 B * 96 + 0.6 MB =~ 8.3 MB
-
With https://github.com/cilium/tetragon/pull/4244, we will have just one ebpf prog for each policy, fmod_ret prog will be shared among policies. Unfortunately, one EBPF program for each policy is still too much for our use case
Our Mitigation
Given the memory overhead and the number of eBPF programs, we decided to switch to a custom implementation that injects just one eBPF program for all policies, and each policy just populates a map, binding itself to the involved cgroups (very similar to what we did here https://github.com/cilium/tetragon/pull/4279). Right now, our unique use case is enforcing binary execution paths on each container, so one unique fmod_ret prog on security_bprm_creds_for_exec is more than enough. While this solves our needs for now, it's clear that if in the future, we decide to expand our use cases, we will end up with the same architectural challenges we faced in this issue. For this reason, I believe this CFP is still valid and useful for future https://github.com/cilium/design-cfps/pull/80
Just a quick recap (the following is tailored to our use case discussed previously in this issue):
Thanks a lot that's useful to have that kind of recap, please continue, I see the PRs are making progress slowly and eventually Tetragon TracingPolicy might scale better with the CFP in progress. Thanks for writing it down.