CFP-41634: Datapath Plugins
Summary
Allow operators of Cilium to integrate their own BPF programs into the Cilium datapath to enable custom BPF-based observability, encapsulation, and routing solutions.
Motivation
As a cloud provider, we operate thousands of Cilium-enabled clusters in an environment where other teams and even savvy customers want to deploy their own BPF-based solutions alongside Cilium for various purposes. Their motivations are varied, but tend to fall into several categories:
- Custom transparent proxies
- Custom observability
- Custom encapsulation and routing
Cilium isn't the only component we manage, and while it forms the core of our datapath, we need to combine it with other systems. Up until now, we've needed to steer other teams and customers away from using BPF at all in their solutions, as there was no good way to deploy it and too many places where Cilium could get in the way.
BPF continues to increase in popularity and others want to use it to solve their problems. This proposal seeks to make this easier.
...
[continued in CFP]
cc @cilium/loader @cilium/sig-datapath
Timo and myself from @cilium/loader have taken a look at this. I think this is a good step in the right direction but we see a few issues with the proposed approach.
I think the biggest is the atomicity of plugin attachment. The problem is that there is no way to attach multiple TCX programs. So you have to first load Cilium's program, then the pre/post hook. So now packets may have already been processed by Cilium's program before the plugins are attached. That could be problematic for the plugins as well as for Cilium in the case where the return value does not block the request anymore. This is something to consider on startup as well as on upgrade.
What we propose is that we look into a solution based on global functions and global function replacement with BPF_PROG_TYPE_EXT.
The way that would work is that we craft a "dispatch" program which will be the program to get attached to a netdev. This program defines global functions which are no-ops by default.
void cilium_pre_1(struct __ctx_buff *ctx) {
return;
}
int cilium_pre_2(struct __ctx_buff *ctx) {
return TC_ACT_UNSPEC;
}
void cilium_post_1(struct __ctx_buff *ctx, int ret) {
return;
}
int cilium_post_2(struct __ctx_buff *ctx, int ret) {
return ret;
}
__section_entry
static int cilium_dispatch(struct __ctx_buff *ctx) {
int ret;
cilium_pre_1(ctx);
ret = cilium_pre_2(ctx);
if (ret != TC_ACT_UNSPEC)
return ret;
ret = cil_from_container(ctx);
cilium_post_1(ret);
ret = cilium_post_2(ctx, ret);
return ret;
}
The global function replacement mechanism lets these global functions cilium_{pre,post}_{1,2} be replaced by BPF programs which are loaded separately.
This has a few advantages. First is that the replacement of these global functions can happen after loading Cilium's program(s) but before attaching. So we would load first, send the program ID/FD + BTF IDs to the plugin, let the plugin attach their hooks, and only then do we attach. This means there is no chance of a packet missing the plugin on initial install, and makes atomic replacement of cilium + plugins possible for existing containers.
This method also means that we do not have to use a map to transfer return codes anymore. These global functions can take up to 5 arguments, return code being one of them, but this could be expanded to include other info. There would also be no need to rewrite programs to add the epilogue.
Using global functions gives us also more control over what a certain hook point can or cannot do. In my example the cilium_pre_1 and cilium_post_1 hooks cannot influence the return code, useful for monitoring style plugins. But you can also imagine changing this such that cilium_post_1 is still called when cilium_pre_2 takes over the packet (skipping cil_from_container). So that a monitoring plugin can also observe what hooks did. Not to mention the potential for us to hook Hubble into this so we can just log that a plugin took control over a packet.
We could even think about adding these hook points into more internal parts of Cilium, not just pre and post hooks. Not something we are suggesting we do right away but the plugin attachment process for that would be similar.
Global function replace also works with XDP and to the best of my knowledge cGroup programs as well (so could be used for our socket LB logic). So also a more generic solution in that respect.
Having a global function call to a no-op is not free. So when not planning on using any plugins we would simply load + attach the normal root program as we currently do. We should be able to load the cil_... programs as global to static dynamically as we wish, as long as its only argument is the context.
At some point in the future we can look into generating the dispatch programs with exactly the amount of hooks we require (in case of multiple plugins or different plugins).
For the cilium <-> plugin IPC protocol I think GRPC over unix socket is a good approach. Using a shared BPFFS and pins has a few downsides. Having to make sure that Cilium and the plugin share the same BPFFS is a unnecessary complication. We also assume any BPF resource is actually pinned in the first place. Lastly, pins are subject to race conditions. I would prefer to just share BPF IDs which are always unique to the exact resource, do not require pinning or sharing BPFFS.
@dylandreimerink Thanks for taking a look.
I think the biggest is the atomicity of plugin attachment.
Yeah, this is a drawback with the multi-attachment approach. (I wish there was a way to atomically update a set of attachments through TCX)
The way that would work is that we craft a "dispatch" program which will be the program to get attached to a netdev. This program defines global functions which are no-ops by default.
I have a few concerns with this approach:
- ~In your example, how would you ensure that the post-Cilium programs are run if
cil_from_containertail calls? I considered an approach not too dissimilar to this early on, but discarded it because of this limitation. Similarly, how would you ensure that the rest of the programs run if the pre-Cilium hook tail calls? One thought is that you could do a similar exit point instrumentation trick as used in the current CFP to allow for easier chaining.~ Edit: Nevermind, it occurs to me now that this should actually work even if the original function tail calls. - Combining BPF-to-BPF with tail calls enforces a stack size constraint of 256 bytes down from 512. It seems like this approach would place greater constraints on Cilium and require some refactoring to bring stack sizes down across the board. I've done some experiments around this before and always hit this wall.
For the cilium <-> plugin IPC protocol I think GRPC over unix socket is a good approach. Using a shared BPFFS and pins has a few downsides. Having to make sure that Cilium and the plugin share the same BPFFS is a unnecessary complication. We also assume any BPF resource is actually pinned in the first place. Lastly, pins are subject to race conditions. I would prefer to just share BPF IDs which are always unique to the exact resource, do not require pinning or sharing BPFFS.
One thing to consider is program lifecycle management. My original idea was to just return a program ID instead of doing the pin path dance, because as you point out it adds another point of coordination between Cilium and the plugin. But there's a problem: the plugin itself can't close the file descriptor keeping the program alive before it sends it response; it has to hold its reference to the program so it doesn't get cleaned up before Cilium takes ownership over its lifecycle. So, how do you deal with this?
-
One option is to complicate the gRPC interface to add program lifecycle handling (
ProgramPut/ProgramRelease), but this creates a stateful interface where Cilium has to keep track of which programs it's "acked" and which it hasn't to avoid resource leaks. -
Another option is to enforce some requirements on the plugin itself and maybe rethink what
LoadSKBProgramis meant to do:- A plugin maintains its own pins inside BPFFS but only returns program IDs to Cilium.
LoadSKBProgrammight load a program if it's not already loaded or simply return the ID of the program it's already loaded for that attachment point.
But then you need some extra lifecycle management hooks for the plugin and Cilium must explicitly tell the plugin when it's time to unpin/unload a plugin (
UnloadSKBProgram?).
Either way, this introduces more coordination between Cilium and the plugin to manage state. With the pin path stuff, Cilium takes over the program lifecycle after receiving a response from the plugin, so no additional coordination is needed.
I have a few concerns with this approach: Combining BPF-to-BPF with tail calls enforces a stack size constraint of 256 bytes down from 512. It seems like this approach would place greater constraints on Cilium and require some refactoring to bring stack sizes down across the board. I've done some experiments around this before and always hit this wall.
That is true, but as far as I understand it the stack limit only applies to the program and its sub-programs when BPF-to-BPF functions and tail calls are mixed. It does not apply to the tail calls, where most of our stack usage is. I am working on a POC of my suggestion to see if my understanding is correct. So as long as our entrypoint uses less then 256 bytes, then we should not hit the issue for the rest. Worst case we have to tail call from the dispatch into the entrypoint.
Another consideration is the kernel version needed to allow mixing of tail calls and functions. Its 5.10 on x86_64, but 6.0 on ARM64, so plugin support would only work for newer kernel. But given TCX is 6.6 and above I don't think that is an issue.
Lastly, after drawing it out and giving it some more thought my initial suggestion needs some slight modification. Since cil_... is very likely to tail call we can't just place the post hooks after it in the dispatch program. We will still need to add an epilogue to all programs like in your initial suggestion. However, it would tail call into a common "cil_exit" program which would then call the post hooks.
I end up with this
We would still have the per-CPU map to transfer the return value (and perhaps other info) between cil_epilogue and cil_exit but that would not be something internal. That way we can change the details or migrate away if we need to.
Long term, plans for Cilium are to migrate from tail calls to BPF-to-BPF functions. Currently the 6.0 kernel requirement on ARM64 is holding us back from migrating, but with 5.10 becoming EOL on 31 Dec 2026, I expect that to be a concern for at most one/two more Cilium releases. After which we can hopefully change the internals to make it all nice an neat, without modification to the plugin contract.
One thing to consider is program lifecycle management. My original idea was to just return a program ID instead of doing the pin path dance, because as you point out it adds another point of coordination between Cilium and the plugin. But there's a problem: the plugin itself can't close the file descriptor keeping the program alive before it sends it response; it has to hold its reference to the program so it doesn't get cleaned up before Cilium takes ownership over its lifecycle. So, how do you deal with this?
Right. That is a good point, at least of the TCX approach. With the global function replacement, Cilium would load, send over the program ID and BTF IDs needed for the plugin to load + attach the EXT programs. Attachment works via a BPF link which the plugin would have ownership over, being able to pin it or keep hold of its FD. So you can make a choice in the plugin as to what happens when the plugin process exits. The plugin can detach whenever it wishes by removing the link or technically even swap out a program if it so wished. And Cilium can replace its program whenever it wishes.
I would have to check what happens when we remove the last reference program to the dispatch program. I presume the EXT program holds no reference, the EXT program detaches, but I am not sure what that does to the link.
Like I mentioned, I hope to put together a rough POC to see if this even works, borrowing some parts from yours if you don't mind.
That is true, but as far as I understand it the stack limit only applies to the program and its sub-programs when BPF-to-BPF functions and tail calls are mixed. It does not apply to the tail calls, where most of our stack usage is. I am working on a POC of my suggestion to see if my understanding is correct. So as long as our entrypoint uses less then 256 bytes, then we should not hit the issue for the rest. Worst case we have to tail call from the dispatch into the entrypoint.
Yeah, fair point. This may end up being a non-issue in that case.
Another consideration is the kernel version needed to allow mixing of tail calls and functions. Its 5.10 on x86_64, but 6.0 on ARM64, so plugin support would only work for newer kernel. But given TCX is 6.6 and above I don't think that is an issue.
Yeah. I'll have to double check on RHEL-based kernels as well here, since they're always a special case. RHEL 9 should be fine, but I'm not sure about RHEL 8 kernels (a persistent thorn in our side but unfortunately something I need to support). TCX isn't supported in either; my original plan was to use legacy TC to achieve the same effect. Hopefully, the approach you proposed would work on all these platforms, but I need to do some experimentation to be sure.
Lastly, after drawing it out and giving it some more thought my initial suggestion needs some slight modification. Since cil_... is very likely to tail call we can't just place the post hooks after it in the dispatch program. We will still need to add an epilogue to all programs like in your initial suggestion. However, it would tail call into a common "cil_exit" program which would then call the post hooks.
Yep, this would work.
Right. That is a good point, at least of the TCX approach. With the global function replacement, Cilium would load, send over the program ID and BTF IDs needed for the plugin to load + attach the EXT programs. Attachment works via a BPF link which the plugin would have ownership over, being able to pin it or keep hold of its FD. So you can make a choice in the plugin as to what happens when the plugin process exits. The plugin can detach whenever it wishes by removing the link or technically even swap out a program if it so wished. And Cilium can replace its program whenever it wishes.
Yeah, the contract would look different with this approach, obviating the need for the plugin to return a reference to a program. So, this should also be a non-issue if we go in this direction.
I would have to check what happens when we remove the last reference program to the dispatch program. I presume the EXT program holds no reference, the EXT program detaches, but I am not sure what that does to the link.
Like I mentioned, I hope to put together a rough POC to see if this even works, borrowing some parts from yours if you don't mind.
Sure. Overall, I like the approach if it works. In the meantime, I need to check on the RHEL kernels to see what level of support they offer for these features. Hopefully the answer is "everything works". Worst case, I might have to consider some alternatives just for these platforms to make plugins work.
[jrife_google_com@rocky-8 ebpf]$ git diff
diff --git a/link/tracing_test.go b/link/tracing_test.go
index a53445c..fb7ebda 100644
--- a/link/tracing_test.go
+++ b/link/tracing_test.go
@@ -10,7 +10,7 @@ import (
)
func TestFreplace(t *testing.T) {
- testutils.SkipOnOldKernel(t, "5.10", "freplace")
+ /* testutils.SkipOnOldKernel(t, "5.10", "freplace") */
file := testutils.NativeFile(t, "../testdata/freplace-%s.elf")
spec, err := ebpf.LoadCollectionSpec(file)
[jrife_google_com@rocky-8 ebpf]$ sudo go test ./link/... -run Freplace -v
=== RUN TestFreplace
=== RUN TestFreplace/link/pinning
=== RUN TestFreplace/link/update
link_test.go:181: tracing update: not supported
=== RUN TestFreplace/link/info
=== RUN TestFreplace/from_fd
--- PASS: TestFreplace (0.05s)
--- PASS: TestFreplace/link/pinning (0.00s)
--- SKIP: TestFreplace/link/update (0.00s)
--- PASS: TestFreplace/link/info (0.00s)
--- PASS: TestFreplace/from_fd (0.00s)
PASS
ok github.com/cilium/ebpf/link (cached)
[jrife_google_com@rocky-8 ebpf]$ uname -r
4.18.0-553.72.1.el8_10.x86_64
[jrife_google_com@rocky-8 ebpf]$
I think the RHEL 4.18 kernel has support for BPF_PROG_TYPE_EXT. I ran the TestFreplace test case on a rocky-8 instance and it seems to work. This also appears to be in the kernel's source.
I need to check on the RHEL kernels to see what level of support they offer for these features.
I think the RHEL 4.18 kernel has support for BPF_PROG_TYPE_EXT. I ran the TestFreplace test case on a rocky-8 instance and it seems to work.
Alright that is good to know. I am not sure if it matters that much, I hope its acceptable to have a slightly higher kernel requirement for plugin support.
I have been playing around with this a bit. In my testing it seems tail call and global function mixing works. However, I did find what I think is a deal breaker. We have an interesting mechanism in Cilium where we can do "local delivery", meaning that when a pod tries to talk to another pod on the same node we will actually tail call into the policy program of the destination pod to perform ingress policy before redirecting the packet. But since that policy program is loaded as part of the destination pods program, it would call the post hooks of that program, before the packet is even redirected.
Even after a migration from tail calls to BPF-to-BPF functions we would have this side effect. The reason for doing it this way is that we attach our BPF programs only the the veth in the host NS to avoid containers from removing BPF programs on the container side veth to bypass policy. Once Netkit becomes the default, I don't think we will have a need for this anymore, and it can work in the nice initial version.
So for the time being we still need a way to work around it. I have considered a few ways to go about it, and what I settled on is an approach that is in-between the TCX and PROG_EXT design.
See, the TCX based approach has the advantage that no matter where you tail call in Cilium, execution always returns to the TCX hook after it. I still think that PROG_EXT is the way to go in term of the how plugins attach though. This tail calling into different collections is only really an issue for the LXC and Host programs. So the simpler approach still works for XDP and cGroup programs. PROG_EXT still allows for hook points inside datapath code (once we have converted to functions). And it allows us to move internal details around.
So, what I came up with as hybrid approach is the following.
We go with the dispatcher model, but only for the pre hooks. The dispatcher does a tail call into the entrypoint so it can still keep its 512 byte stack (for now). We use the logic you proposed to store the return value in a per-CPU map (ipb for inter-program-buffer) and return the UNSPEC return code. We then have the cil_post program, essentially also a dispatcher, which calls the post global functions.
The difference being that Cilium takes ownership over all of it, the programs, attachment, and the per-CPU map. Cilium is free to change these to better implementations once we can. The plugin interface will be the global functions.
Now this still leaves the atomic attaching of TCX programs. I think I also have a workable solution for that. What we do is we make a boolean global variable with which we can bypass the programs. I called it program_enable in the above diagram. What will happen is the dispatcher checks this value, and if false returns UNSPEC and also updates the inter-program-buffer to notify the post hook to do the same. That way we never get issues where a packet skipped Cilium but still ended up in the post hook. This also works for the short time where the dispatcher is attached by the post hook not yet.
We then combine that with a slight change in how we do the attaching, going from this.
To
These draw out every distinct phase in which a packet can make it through the pipeline.
It should be noted that during endpoint creation the pod is not yet marked "ready" until we are done. So we do not need to worry about packets during the startup phase. We mostly worry about the upgrade.
So, when we are upgrading, instead of replacing the links we would attach our new program ahead of the old programs while they are disabled. Once both the dispatch and post program are attached we flip the switch. After that we can safely remove the old programs.
Plugins need not be aware of this, again, since Cilium owns these bits. And at some future date if we can get rid of the separate post program we can just switch over without plugins being any the wiser.
Did I miss anything? What do you think of the above?
Did I miss anything? What do you think of the above?
Feel free to correct me if I'm missing something obvious, but...
Lastly, after drawing it out and giving it some more thought my initial suggestion needs some slight modification. Since cil_... is very likely to tail call we can't just place the post hooks after it in the dispatch program. We will still need to add an epilogue to all programs like in your initial suggestion. However, it would tail call into a common "cil_exit" program which would then call the post hooks.
I don't think this is true. bpf_tail_call() only unwinds the current stack frame, so in theory it shouldn't matter if the inner function tail calls. The return value of the initial call should be the return value of the last tail call in the chain.
I put together this small PoC to double check:
__section_entry
int cilium_pre_tail(struct __ctx_buff *ctx __maybe_unused) {
printk("\tcilium_pre_tail(ctx)\n");
return TC_ACT_UNSPEC;
}
__section_entry
int cilium_post_tail(struct __ctx_buff *ctx __maybe_unused) {
printk("\tcilium_post_tail(ctx)\n");
return -5;
}
struct {
__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(max_entries, 2);
__array(values, int());
} call_map __section(".maps") = {
.values = {
[0] = &cilium_pre_tail,
[1] = &cilium_post_tail,
},
};
static __noinline int cilium_pre(const struct __ctx_buff *ctx) {
volatile int ret = TC_ACT_UNSPEC;
tail_call_static(ctx, call_map, 0);
return ret;
}
static __noinline int cilium(const struct __ctx_buff *ctx __maybe_unused) {
return TC_ACT_OK;
}
static __noinline int cilium_post(const struct __ctx_buff *ctx, int cil_ret) {
volatile int ret = cil_ret;
tail_call_static(ctx, call_map, 1);
return ret;
}
static int cilium_dispatch(const struct __ctx_buff *ctx) {
int ret;
ret = cilium_pre(ctx);
printk("cilium_pre(ctx) returned %d\n", ret);
if (ret != TC_ACT_UNSPEC)
return ret;
ret = cilium(ctx);
printk("cilium(ctx) returned %d\n", ret);
ret = cilium_post(ctx, ret);
printk("cilium_post(ctx) returned %d\n", ret);
return ret;
}
You have to be careful that the compiler doesn't try to optimize and pull the constants into the calling function (i.e. need to mark int ret as volatile), but in this example you can see that the call to cilium_pre returns inside cilium_dispatch even though it tail calls to cilium_pre_tail. The return value actually originates from the tail call target. This is clearer to see in the result of cilium_post where -5 is returned.
<...>-1076989 [012] b..11 85727.838885: bpf_trace_printk: cilium_pre_tail(ctx)
<...>-1076989 [012] b..11 85727.838897: bpf_trace_printk: cilium_pre(ctx) returned -1
<...>-1076989 [012] b..11 85727.838898: bpf_trace_printk: cilium(ctx) returned 0
<...>-1076989 [012] b..11 85727.838898: bpf_trace_printk: cilium_post_tail(ctx)
<...>-1076989 [012] b..11 85727.838899: bpf_trace_printk: cilium_post(ctx) returned -5
You should just be able to set up a harness like this and replace the cilium_pre, cilium, and cilium_post calls at load time. Even if cilium_pre, cilium, or cilium_post tail calls you still return to the calling context, cilium_dispatch and can inspect whatever the return value was inside cilium_post.
In practice, the "template" would look a bit more like this:
__section("freplace/cilium_pre")
static __noinline int cilium_pre(const struct __ctx_buff *ctx) {
volatile int ret = TC_ACT_UNSPEC;
return ret;
}
__section("freplace/cilium")
static __noinline int cilium(const struct __ctx_buff *ctx __maybe_unused) {
volatile int ret = 0;
return ret;
}
__section("freplace/cilium_post")
static __noinline int cilium_post(const struct __ctx_buff *ctx, int cil_ret) {
volatile int ret = cil_ret;
return ret;
}
__section_entry
int cilium_dispatch(const struct __ctx_buff *ctx) {
int ret;
ret = cilium_pre(ctx);
if (ret != TC_ACT_UNSPEC)
return ret;
ret = cilium(ctx);
return cilium_post(ctx, ret);
}
I have been playing around with this a bit. In my testing it seems tail call and global function mixing works. However, I did find what I think is a deal breaker. We have an interesting mechanism in Cilium where we can do "local delivery", meaning that when a pod tries to talk to another pod on the same node we will actually tail call into the policy program of the destination pod to perform ingress policy before redirecting the packet. But since that policy program is loaded as part of the destination pods program, it would call the post hooks of that program, before the packet is even redirected.
So, I think this scenario is a non-issue. We wouldn't need to instrument any of the exit points, since things should "just work". For example, if a host program tail calls into an lxc collection, e.g. cil_lxc_policy, when cil_lxc_policy or some program it tail calls into exits, control returns to the original dispatcher program and you'd end up running the post-Cilium hook associated with that host interface attachment point.
I don't think this is true. bpf_tail_call() only unwinds the current stack frame, so in theory it shouldn't matter if the inner function tail calls. The return value of the initial call should be the return value of the last tail call in the chain.
:exploding_head: I was truly under the impression that because it reuses the stack frame it would never return execution. Not entirely sure where I got that idea. I think because in x86 the return address is on the stack, but of-course in eBPF its just something implicit.
Anyway, that makes this a lot easier. Thank you!
Anyway, that makes this a lot easier.
Indeed it does. It also means no special handling depending on program type. I think this should work well as a generic mechanism for instrumenting specific entrypoints. I'll update the CFP to describe the approach.
Given this is fairly generic, it also makes me rethink other elements of the design:
While the original design focuses only on instrumenting only TC/TCX attachment points, this new mechanism lends itself well to instrumentation of any and all attachments (XDP, cgroup hooks, etc.). I can see the actual logic for interacting with a plugin living deeper in the stack (e.g. pkg/bpf/) and making the gRPC interface a bit more generic. I'm imagining something more like an InstrumentCollection request where the loader sends a request containing the set of programs in the collection that are instrumentable and the plugin optionally loads the pre/post hooks for each program in that collection. I'd have to think a bit more about the finer details and sequencing, but that would be the gist of it. We'd probably just serialize and propagate some of the CollectionOptions through the request as well in case the plugin wants to use the same options to configure its own programs.
I'm imagining something more like an
InstrumentCollectionrequest where the loader sends a request containing the set of programs in the collection that are instrumentable and the plugin optionally loads the pre/post hooks for each program in that collection.
Yea. Not sure if it would be better to do it per collection or per program. Cilium conditionally chooses to attach or not attach programs. For example, we only attach a cil_to_netdev program under certain configuration but its still always there in the collection.
While the original design focuses only on instrumenting only TC/TCX attachment points, this new mechanism lends itself well to instrumentation of any and all attachments (XDP, cgroup hooks, etc.). I can see the actual logic for interacting with a plugin living deeper in the stack (e.g. pkg/bpf/) and making the gRPC interface a bit more generic.
Yes exactly. I think the mechanism is pretty clear now. Remaining is the contract and API. The relationship between Cilium and a plugin is necessarily going to be very cooperative. For every hook point we should make clear what we expect plugins to possibly do or not do. For example, in a PRE hook we might expect a plugin to take control over a packet (returning a OK/SHOT verdict) at which point Cilium does not care about what happens with the packet, but it should not modify the ctx->mark while still returning UNSPEC since Cilium may have assumptions about aspects of the packet/skb.
I think the contract part is to write down something to that extent down in something like a spec document, and the API part to have some programmatic way to communicate which hook points have which rules.
I guess this touches on a broader concern of mine. I know the CFP mentions cross-version compatibility as being a non-goal as well as "Supporting a plugin ecosystem" being a non-goal, however, I am not entirely sure those can be ignored. I think forwards and backwards compatibility is something we should think about now since its very hard to add later on.
Same goes for the ecosystem part. I understand you likely have a specific usecase where in your environment there will only be your own plugin, but once this is part of the project, other people will want to make plugins to, and at some point a user will want to use two plugins at once. If we don't consider that then its going to be painful to add later. What happens when 2 plugins want to use the same hookpoint?
We'd probably just serialize and propagate some of the CollectionOptions through the request as well in case the plugin wants to use the same options to configure its own programs.
Keeping the above in mind, what happens on updates? When an setting disappears or is renamed. Or a map changing shape. I figure that if there is a Cilium release where any of these change, you would first need to update the plugin, and then tell a user to upgrade the plugin daemon set first. The new plugin version can then load multiple variations of the eBPF code depending on the Cilium version. And then Cilium can be upgraded. So we need to include some way to communicate Cilium version or something similar so the plugin can do the right thing.
If instead of straight propagation we include things explicitly in the gRPC protobuf, then we can use an API version and gRPC backwards/forwards compatibility features to deal with changes over time.
Perhaps something similar is also in order for BPF level guarantees. In theory you can use the semantic version number to switch between different paths in your plugin, but it would likely be better if we can communicate capabilities and contract info programmatically.
Yes exactly. I think the mechanism is pretty clear now. Remaining is the contract and API. The relationship between Cilium and a plugin is necessarily going to be very cooperative. For every hook point we should make clear what we expect plugins to possibly do or not do. For example, in a PRE hook we might expect a plugin to take control over a packet (returning a OK/SHOT verdict) at which point Cilium does not care about what happens with the packet, but it should not modify the ctx->mark while still returning UNSPEC since Cilium may have assumptions about aspects of the packet/skb.
I think the contract part is to write down something to that extent down in something like a spec document, and the API part to have some programmatic way to communicate which hook points have which rules.
Beyond the obvious high-level contract (i.e. "a PRE hook may return -1 to continue onto the Cilium program; otherwise, execution of the pipeline terminates" or "a POST hook runs after the Cilium program; it may access Cilium's intended verdict by xyz and either forward it or override it...") I struggle to think of a set of restrictions about what a particular hook point ought to do or not do with a packet or other context that applies universally. Your example about returning UNSPEC and ctx->mark seems like generally good guidance, but I can't say with confidence that this should always be the case (it's hard to anticipate what a future plugin might want to do). It would be nice if we could enumerate a set of rules that guarantees a plugin isn't breaking Cilium or compromising security in some way, but to me it seems impossible to extricate the plugin from Cilium datapath internals or draw a clear boundary between what a plugin can touch and do vs what Cilium can touch and do. I'd like to hear more about this if you see some obvious constraints you'd like to see enforced on plugin programs in various contexts, but from where I'm standing this seems hard.
I guess this touches on a broader concern of mine. I know the CFP mentions cross-version compatibility as being a non-goal as well as "Supporting a plugin ecosystem" being a non-goal, however, I am not entirely sure those can be ignored. I think forwards and backwards compatibility is something we should think about now since its very hard to add later on.
I'm all for building in a stronger contract for forward/backwards compatibility from the start. Mainly, I did not want to burden Cilium or place more restrictions than necessary on how the datapath can change in the future. In the past, discussions around datapath extension have stalled out due in no small part to concerns around maintenance cost or the restrictions it would impose. However, stronger stability guarantees would no doubt be helpful.
Same goes for the ecosystem part. I understand you likely have a specific usecase where in your environment there will only be your own plugin, but once this is part of the project, other people will want to make plugins to, and at some point a user will want to use two plugins at once. If we don't consider that then its going to be painful to add later. What happens when 2 plugins want to use the same hookpoint?
The classic "we have one plugin/extension/attachment, yes. what about second plugin/extension/attachment?" question. I guess I can imagine a scenario where you might have two components that attach hooks to disjoint sets of hooks in the Cilium datapath that are manged by different teams. The harder question to answer is what behavior and interaction should look like for two plugins that want to attach programs to the same hook point; who decides on the relative attachment order? It's not a simple matter of one plugin having globally higher priority. In some contexts, one plugin's programs might need to run before another's, and in other contexts the opposite might be true. Does a plugin itself possess enough knowledge to determine this order or does this need to be something user configurable? I was hoping to sidestep this can of worms for now, but I can see how it might affect the plugin contract down the line so yeah it might be worth thinking about.
Keeping the above in mind, what happens on updates? When an setting disappears or is renamed. Or a map changing shape. I figure that if there is a Cilium release where any of these change, you would first need to update the plugin, and then tell a user to upgrade the plugin daemon set first. The new plugin version can then load multiple variations of the eBPF code depending on the Cilium version. And then Cilium can be upgraded. So we need to include some way to communicate Cilium version or something similar so the plugin can do the right thing.
Yes, this is what I had in mind. At some level a plugin needs to be aware of internal Cilium implementation changes, so you would adjust+test a plugin to be compatible with a new Cilium release ahead of time. You deploy the new plugin version that supports Cilium versions n-1 and n then upgrade Cilium from n-1 to n. If you upgrade Cilium first then in theory any gRPC requests to the old plugin version (and therefore any datapath regenerations) should fail as long as plugins enforce a "supported Cilium version" range check. I guess this is what you're saying basically.
If instead of straight propagation we include things explicitly in the gRPC protobuf, then we can use an API version and gRPC backwards/forwards compatibility features to deal with changes over time.
I can see a case for mirroring config fields in the gRPC proto vs just passing something raw through to insulate plugins from internal config renames and such, but I don't know if an API version is sufficient; I think you would need to provide the actual Cilium version to the plugin no matter what. My thought here is that there may be aspects of Cilium's behavior that change between versions that must be taken into account by the plugin. I don't want to create a scenario where all config needs to be manually plumbed through into a gRPC field, but I'm sure we can make common things (e.g. endpoint IP or endpoint ID) part of the actual proto and have some stronger guarantees around how those are represented.
Perhaps something similar is also in order for BPF level guarantees. In theory you can use the semantic version number to switch between different paths in your plugin, but it would likely be better if we can communicate capabilities and contract info programmatically.
I'm not quite sure what you mean by "capabilities and contract info" here. Like map structure?
Beyond the obvious high-level contract (i.e. "a PRE hook may return -1 to continue onto the Cilium program; otherwise, execution of the pipeline terminates" or "a POST hook runs after the Cilium program; it may access Cilium's intended verdict by xyz and either forward it or override it...") I struggle to think of a set of restrictions about what a particular hook point ought to do or not do with a packet or other context that applies universally.
Yea, fair.
It would be nice if we could enumerate a set of rules that guarantees a plugin isn't breaking Cilium or compromising security in some way, but to me it seems impossible to extricate the plugin from Cilium datapath internals or draw a clear boundary between what a plugin can touch and do vs what Cilium can touch and do. I'd like to hear more about this if you see some obvious constraints you'd like to see enforced on plugin programs in various contexts, but from where I'm standing this seems hard.
I think this was wishful thinking on my part. It would likely be really hard to even define all the thing Cilium relies on.
I guess the scenario I am worried about is that during a development cycle Cilium breaks a plugin. Lets assume the plugin can implement a workaround, but that workaround does not function on the old version. Now the plugin need a way to pick which version of the BPF to load. Perhaps simply passing the Cilium semver would work? But that might get messy if we backport something.
If the burden is going to be on the plugin to keep up with Cilium, we should at least give the plugin some info to deal with such changes.
At some level a plugin needs to be aware of internal Cilium implementation changes, so you would adjust+test a plugin to be compatible with a new Cilium release ahead of time. You deploy the new plugin version that supports Cilium versions n-1 and n then upgrade Cilium from n-1 to n. If you upgrade Cilium first then in theory any gRPC requests to the old plugin version (and therefore any datapath regenerations) should fail as long as plugins enforce a "supported Cilium version" range check. I guess this is what you're saying basically.
Yes, exactly.
The classic "we have one plugin/extension/attachment, yes. what about second plugin/extension/attachment?" question. I guess I can imagine a scenario where you might have two components that attach hooks to disjoint sets of hooks in the Cilium datapath that are manged by different teams. The harder question to answer is what behavior and interaction should look like for two plugins that want to attach programs to the same hook point;
For the disjointed set its indeed easy. For some hooks its easy, the observation hooks for example, where order does not matter since non of them should modify the context or change return code.
For hooks that can modify the return code and/or modify the context its more difficult. I think there are three possibilities in case you have plugins on the same hook:
- Plugins are compatible (any order works)
- Plugins are incompatible (is never going to work)
- Plugins are order dependent (for example one plugin modifying a packet and another adding encapsulation running such that modification happens on the right layer)
I think a TCX style of relative ordering would work the best. We make plugins responsible for coexisting and figuring out the attachment order, not the user. In theory if both plugins respect each other, the order in which we give them a chance to attach shouldn't matter. This becomes slightly harder if only one of the plugins knows about the other. I am inclined to just use the name/identifier of a plugin to determine the order in which we give plugins a chance to relatively attach. But I am open to better ideas.
In terms of mechanism to allow a variable set of plugins we can make a directory for these. The agent can then watch the directory and open connections to unix sockets added there. This should also allow for rolling updates.
What I was thinking is that we might generate the dispatcher program on the fly with cilium/ebpf instead of having a C version. Then we can make hook points on demand. So if we have 3 observation plugins, we make 3 hook points. If we have a plugin which only wants to do a PRE hook we have no observation hooks. Should be better performance as well, not having a bunch of unused hookpoints. It would be a slightly more complex interaction, before we load Cilium's program we would first ask all plugins which hook points they want (and do the relative attachment).
So you would get something like:
- Cilium parses ELF to CollectionSpec
- [for each plugin] Cilium communicates CollectionSpec, which hooks other plugins are requesting, and asks which hooks at which relative attachment this plugin wants.
- Cilium generates dispatch program + no-op hooks according to plugin requests
- Cilium loads spec to Collection
- [for each plugin] Cilium sends info needed to attach to the hook points requests by that plugin
- Cilium attaches dispatch programs
I'm not quite sure what you mean by "capabilities and contract info" here. Like map structure?
Nevermind that. Does not make sense given the above.
I guess the scenario I am worried about is that during a development cycle Cilium breaks a plugin. Lets assume the plugin can implement a workaround, but that workaround does not function on the old version. Now the plugin need a way to pick which version of the BPF to load. Perhaps simply passing the Cilium semver would work? But that might get messy if we backport something.
If the burden is going to be on the plugin to keep up with Cilium, we should at least give the plugin some info to deal with such changes.
My feeling is that providing the semver is a good starting point. On top of that, we should provide the plugin with information on how Cilium is currently configured (e.g. BPF host routing mode vs legacy kernel routing), since those might all factor into what attachments a plugin provides and how those are configured. The PoC passes a LocalNodeConfig field to the plugin through the request. This or something like it could contain the semver and other relevant node config.
I think a TCX style of relative ordering would work the best.
Yes, I think something like the TCX interface is good for expressing ordering intent.
We make plugins responsible for coexisting and figuring out the attachment order, not the user. In theory if both plugins respect each other, the order in which we give them a chance to attach shouldn't matter. This becomes slightly harder if only one of the plugins knows about the other. I am inclined to just use the name/identifier of a plugin to determine the order in which we give plugins a chance to relatively attach. But I am open to better ideas.
I think the decision here hinges on who is developing plugin A, who is developing plugin B, who is trying to deploy these plugins together, and what the relationship is between these parties. Basically, what are the personas involved and what are their capabilities? Broadly speaking, I see two extremes:
- Scenario 1: All In House: This is similar to my situation. We have lots of teams internally who may want to develop and integrate their own datapath plugins, but we have tight coordination and one group who controls what gets deployed together and how those components are configured. There would be a limited set of configurations that are known up front and we could easily encode the rules about who should be attached before/after whom into the logic of the plugins themselves. We would verify that those configurations work before deploying to production and work with internal teams to make adjustments as necessary. You have a potentially quadratic number of combinations as you add plugins, but in practice this would probably be fairly constrained.
- Scenario 2: Off The Shelf Assembler: I've deployed Cilium and am shopping around GitHub for some Cilium plugins that solve some niche use cases I have. I want to deploy two such plugins I've found and make them work together, but those plugins were never developed with each other in mind. When I deploy them, things don't quite work since the attachment order is backwards from what is needed to make them work. I need to patch the plugins to be aware of each other.
I'm less inclined to care about cases like scenario 2 (I think those using datapath plugins would likely have fairly advanced requirements and be developing their own plugins anyway), but even in scenario 1 it might be nice to be able to describe attachment ordering constraints in some way that is decoupled from the plugins themselves. To sum up, I see three options:
- Cilium provides the context to a plugin in each request describing all programs currently attached to the attachment point, the plugin decides where in this list to insert its own program, and the response contains an index indicating the spot in that list it wants to place it.
- Cilium provides no context to the plugin about what's already attached. The plugin's response just contains a static set of ordering constraints that Cilium needs to enforce as it asks each plugin if it wants to add an attachment to that attachment point. Ex:
{ "anchor": "before", "plugin": "some-other-plugin" }. Cilium would need to collect the full set of constraints before assembling the final program chain. - A set of static scheduling constraints similar to those described in option 2 are instead defined in a CRD or set of CRDs, e.g.,
CiliumDatapathAttachmentPolicy.... spec: rules: - attachmentType: "skb" plugin: "my-plugin" before: "other-plugin" ...
Option three seems the most flexible, but I could see the case for starting with option two and adding support later for a CRD-driven policy approach if needed.
In terms of mechanism to allow a variable set of plugins we can make a directory for these. The agent can then watch the directory and open connections to unix sockets added there. This should also allow for rolling updates.
I wonder if we could just have Cilium create the socket and have plugins connect to it. This might be simpler.
What I was thinking is that we might generate the dispatcher program on the fly with cilium/ebpf instead of having a C version.
Yep, this is what I was thinking as well.
Then we can make hook points on demand. So if we have 3 observation plugins, we make 3 hook points. If we have a plugin which only wants to do a PRE hook we have no observation hooks. Should be better performance as well, not having a bunch of unused hookpoints. It would be a slightly more complex interaction, before we load Cilium's program we would first ask all plugins which hook points they want (and do the relative attachment).
So you would get something like:
- Cilium parses ELF to CollectionSpec
- [for each plugin] Cilium communicates CollectionSpec, which hooks other plugins are requesting, and asks which hooks at which relative attachment this plugin wants.
- Cilium generates dispatch program + no-op hooks according to plugin requests
- Cilium loads spec to Collection
- [for each plugin] Cilium sends info needed to attach to the hook points requests by that plugin Cilium attaches dispatch programs
Yeah agree, thinking about this more last week it seems like we need two phases of interaction with plugins as you outline: phase one determines which programs in your collection need to be replaced with dispatcher programs and phase two actually does the attachment. I do wonder if the plugin interface could be made slightly cleaner by rolling all of this into one request/phase by bringing back the BPFFS pinning thing. That way, the response from the plugin in the first request provides both a reference to the loaded program and constraints around where it should be placed in the list instead of needing to have a second round of requests to finalize everything. I want to keep the gRPC interface as simple to implement as possible, but I don't feel too strongly either way. Also, as stated above, I think there are some alternatives on how we express the attachment "scheduling constraints" with a CRD being on the extreme end of configurability.
Regarding bpffs: there are other ways to plumb fds between processes: pidfd_getfd(2) and SCM_RIGHTS. Since communication would already take place over a unix socket, the latter feels like the obvious choice. I assume this approach works for all kernel resources that can be represented by an fd.
pidfd_getfd requires special ptrace permissions and I haven't investigated how it would realistically obtain the Cilium agent's pidfd. Plugins would also need to get notified of agent restarts somehow, which the unix socket could provide out of the box.
I'd steer clear of bpffs in any case, it's not worth the pain.
My feeling is that providing the semver is a good starting point. On top of that, we should provide the plugin with information on how Cilium is currently configured (e.g. BPF host routing mode vs legacy kernel routing), since those might all factor into what attachments a plugin provides and how those are configured. The PoC passes a LocalNodeConfig field to the plugin through the request. This or something like it could contain the semver and other relevant node config.
Yes, lets do that. Pass the the semver + node config + endpoint config, the fields being explicit in protobuf so we can manage changes over time more explicitly.
I think the decision here hinges on who is developing plugin A, who is developing plugin B, who is trying to deploy these plugins together, and what the relationship is between these parties. Basically, what are the personas involved and what are their capabilities?
Yes agreed. I think you are also right on the mark with the personas. I think option 1 would work for all-in-house or fully public, where there is full control or transparency and the possibility of collaboration.
When I deploy them, things don't quite work since the attachment order is backwards from what is needed to make them work. I need to patch the plugins to be aware of each other.
Perhaps a bit naive, I was hoping that a situation like this could be solved by a PR or an issue. Perhaps the more difficult situations would be when combining public and in-house, since some open source plugin will never be aware of proprietary plugins.
Option 2 of communicating constraints is more work, but I do kind of like the idea personally. It makes it possible for plugins to properly co-exist if only one of them is aware of the other. Option 3 might be over the top. At that point we are burdening users with low level implementation details. While it would allow someone to configure 2 plugins that don't know each other and have them attach properly, I am not sure if its worth it. So I am inclined to look in the direction of option 2, like you mentioned, we should be able to add option 3 in the future if we somehow do end up wanting that.
I wonder if we could just have Cilium create the socket and have plugins connect to it. This might be simpler.
I do like the idea of Cilium being the "server", serving connections, at least on the socket level. In gRPC terms we can still have Cilium be the client making requests.
How would we handle cilium agent lifecycle in this model? When the agent starts it will start to reload its endpoint programs. When every plugin has its own socket we can simply let the agent wait until it has a successful connection to each socket. But with the single socket, its up to the plugins to establish the connection, so how will the agent know when all plugins have established their connections?
It could be solved by adding info about plugins out of bounds, like listing plugins to wait for in cilium options. But that seems painful when adding or removing plugins.
Yeah agree, thinking about this more last week it seems like we need two phases of interaction with plugins as you outline: phase one determines which programs in your collection need to be replaced with dispatcher programs and phase two actually does the attachment. I do wonder if the plugin interface could be made slightly cleaner by rolling all of this into one request/phase by bringing back the BPFFS pinning thing.
I don't think a single request+response would work. For the PROG_EXT programs you need to provide the program ID and BTF ID of the program and function you intend to replace when you load the PROG_EXT, so the verifier can assert its prototype is compatible. So we need to load the dispatcher into the kernel before and send the IDs to the plugin before it can load its programs.
We we can consider is who manages the lifecycle of the plugins programs. My initial idea was the let the plugin manage its own links, making the plugin responsible for pinning them. But once the plugin has loaded the PROG_EXT programs it could also transfer them to Cilium and let Cilium do the attachment and link management.
I have to agree with Timo that BPFFS does not seem like an ideal way to transfer ownership. If I recall correctly iproute2 uses SCM_RIGHTS to transfer map file descriptors. Presumably pidfd_getfd(2) also works. I am not sure if that has advantages over transferring the program IDs, given we also turn IDs into FDs just the same. In any case, I also think something purely over the unix socket is preferable to also having the need to share the BPFFS.
In past messages you mentioned the reason for BPFFS was so plugins do not have to hold the FD open. Something we would have to consider when doing the transfer over unix socket as well. After the second request-response Cilium would have to send a final ACK of some sorts that it has a copy of the FD so the plugin can close its FD without the program being released.
Personally I would would go for the following:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, ask for program IDs
- plugin -> Cilium, send program IDs, ask for ACK
- Cilium -> plugin, ACK (allow plugin to release FDs)
Not sure though how you neatly model that in gRPC. Does a bi-directional stream make sense or just a series of unary RPC calls?
I'd steer clear of bpffs in any case, it's not worth the pain.
Sure.
Option 2 of communicating constraints is more work, but I do kind of like the idea personally. It makes it possible for plugins to properly co-exist if only one of them is aware of the other. Option 3 might be over the top. At that point we are burdening users with low level implementation details. While it would allow someone to configure 2 plugins that don't know each other and have them attach properly, I am not sure if its worth it. So I am inclined to look in the direction of option 2, like you mentioned, we should be able to add option 3 in the future if we somehow do end up wanting that.
Yes, I think option two is a good starting point. Implementation is slightly trickier on the Cilium side, but I think it simplifies things for plugins themselves. The basic idea would be to build a DAG using constraints as edges and programs as nodes, do a topographical sort, and have some kind of cycle detection to discover conflicting constraints. I also like this because it feels more in line with other k8s APIs where constraints are specified declaratively.
How would we handle cilium agent lifecycle in this model? When the agent starts it will start to reload its endpoint programs. When every plugin has its own socket we can simply let the agent wait until it has a successful connection to each socket. But with the single socket, its up to the plugins to establish the connection, so how will the agent know when all plugins have established their connections?
It could be solved by adding info about plugins out of bounds, like listing plugins to wait for in cilium options. But that seems painful when adding or removing plugins.
However you slice it, I think we need some authoritative list of "installed" plugins to live somewhere, whether that's a Cilium flag containing a list of plugin names or a set of CRDs (similar to the CSIDriver CRD that registers a k8s CSI driver). Without this, I think the startup sequencing gets tricky. So, I guess I'll amend my previous comment and suggest this taking a page or two from how CSI is modeled:
- To "register" a plugin you need to add the name of that plugin to a flag passed to Cilium,
--datapath-plugins=a,b,.... The CRD thing seems like overkill unless each plugin needs some configuration in the future. - Cilium creates a subdirectory by that name in some well-known parent directory (
/var/lib/cilium/datapath-plugins/a,/var/lib/cilium/datapath-plugins/b, ...). - Cilium creates a socket in that directory, e.g.,
cdpp.sock. - Plugins mount this directory using a
hostPathvolume and connect to that socket. - Datapath regeneration fails for an endpoint if one of the registered plugins doesn't respond or isn't connected. On initial startup Cilium could wait for connections to be established to each plugin to make sure we don't get a lot of endpoint generation failures at the start.
(I will respond to the rest of your comment tomorrow when I have some more time)
I don't think a single request+response would work. For the PROG_EXT programs you need to provide the program ID and BTF ID of the program and function you intend to replace when you load the PROG_EXT, so the verifier can assert its prototype is compatible. So we need to load the dispatcher into the kernel before and send the IDs to the plugin before it can load its programs.
Ignoring the program ownership transfer problem for a second, my thought was that Cilium would just make a request to each plugin up front in which each plugin loads any programs it wants to load and send a response containing ordering constraints, program references (ids), and info about what each program is meant to instrument. That would give Cilium enough context to decide if it needs to replace any entrypoints with dispatcher functions before it loads the collection. Cilium wouldn't pass any prog ID or BTF type IDs to the plugin, it would just receive names of functions in a collection that a plugin wants to instrument and later translate these to BTF IDs itself. Roughly:
for each plugin:
1. Cilium -> plugin, send settings + attachment context (e.g. iface name(s), ...), ask for hook programs + constraints
2. plugin -> Cilium, send [program ID + attachment point (program name, pre/post)]s
for any program in collection that needs instrumentation by at least one plugin:
generate the dispatcher program spec and replace the original in the collection
load the collection
do program replacement, mapping names to BTF IDs, create links, ...
However, I actually don't like this as much as I originally did for two reasons:
- You can't really ignore the ownership problem, so you would still need a second round of communication with the plugins to ack and allow them to release references. So really there's still two phases. I also don't like needing the explicit ack/release in the interface, as this seems like a recipe for resource leaks. For example, what if Cilium shuts down between the first phase and second phase? I'd rather not have plugins worry about the FD lifecycles and needing to cleanup in cases like this.
- It might be useful for a plugin to insert Cilium programs into a PROG_ARRAY. An interface more like what you describe below could be useful for that eventually.
We we can consider is who manages the lifecycle of the plugins programs. My initial idea was the let the plugin manage its own links, making the plugin responsible for pinning them. But once the plugin has loaded the PROG_EXT programs it could also transfer them to Cilium and let Cilium do the attachment and link management.
I have to agree with Timo that BPFFS does not seem like an ideal way to transfer ownership. If I recall correctly iproute2 uses SCM_RIGHTS to transfer map file descriptors. Presumably pidfd_getfd(2) also works. I am not sure if that has advantages over transferring the program IDs, given we also turn IDs into FDs just the same. In any case, I also think something purely over the unix socket is preferable to also having the need to share the BPFFS.
In past messages you mentioned the reason for BPFFS was so plugins do not have to hold the FD open. Something we would have to consider when doing the transfer over unix socket as well. After the second request-response Cilium would have to send a final ACK of some sorts that it has a copy of the FD so the plugin can close its FD without the program being released.
Personally I would would go for the following:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, ask for program IDs
- plugin -> Cilium, send program IDs, ask for ACK
- Cilium -> plugin, ACK (allow plugin to release FDs)
Not sure though how you neatly model that in gRPC. Does a bi-directional stream make sense or just a series of unary RPC calls?
As I mentioned above, I would prefer if we could avoid plugins needing to worry about FDs outliving a gRPC request and requiring an explicit ACK. I can see this getting messy as with the unexpected Cilium shutdown case. Instead, I wonder if we can do something more like this:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [link IDs + program name]s
- Cilium -> plugin, send [link ID + program name]s
- plugin -> Cilium, send OK
After step two, Cilium has enough information to decide which programs to replace with dispatchers in the collection and how to generate those dispatchers. It would then load the collection and create links for each of the hook points. Initially, these links would point to dummy programs. In the second round of requests, the plugins BPF_LINK_GET_FD_BY_ID to get an FD to links, call BPF_LINK_UPDATE to replace the programs attached to the links, and close the link FD before responding. The program lifecycles would then be tied to that of the links which are always owned by Cilium. This adds more complexity and boilerplate for the plugins, but I think that could be helped with some kind of shim library. I'm also not sure if BPF_LINK_UPDATE would work in this context. If it does, this seems nice since any FD the request opens is closed before it responds.
However you slice it, I think we need some authoritative list of "installed" plugins to live somewhere, whether that's a Cilium flag containing a list of plugin names or a set of CRDs (similar to the
CSIDriverCRD that registers a k8s CSI driver). Without this, I think the startup sequencing gets tricky. So, I guess I'll amend my previous comment and suggest this taking a page or two from how CSI is modeled:* To "register" a plugin you need to add the name of that plugin to a flag passed to Cilium, `--datapath-plugins=a,b,...`. The CRD thing seems like overkill unless each plugin needs some configuration in the future. * Cilium creates a subdirectory by that name in some well-known parent directory (`/var/lib/cilium/datapath-plugins/a`, `/var/lib/cilium/datapath-plugins/b`, ...). * Cilium creates a socket in that directory, e.g., `cdpp.sock`. * Plugins mount this directory using a `hostPath` volume and connect to that socket. * Datapath regeneration fails for an endpoint if one of the registered plugins doesn't respond or isn't connected. On initial startup Cilium could wait for connections to be established to each plugin to make sure we don't get a lot of endpoint generation failures at the start.
If we go with explicitly providing a list of installed plugins, I think the single socket approach is more elegant. The multiple unix sockets would only be nessesery as an implicit way of communicating the installed plugins.
The "waiting for plugins" on startup also brought up some other questions. Mainly around what happens when a plugin does not connect or respond to requests in time. Presumably we would have some sort of timeout after which we have to make a decision: fail endpoint creation or continue without the plugin. If a plugin is crucial for connectivity or security then failure is preferable. For observation plugins, ignoring them would be. Perhaps that would also be something to include in the explicit registration part --datapath-plugins=a:crucial,b:optional (not sure what a nice format is for the flags).
whether that's a Cilium flag containing a list of plugin names or a set of CRDs (similar to the
CSIDriverCRD that registers a k8s CSI driver).
Perhaps a CRD would be the way to go. I imagine that would be easier for users, since the custom resource would be part of the helm chart of the plugin, and not Helm values to modify in Cilium's chart.
my thought was that Cilium would just make a request to each plugin up front in which each plugin loads any programs it wants to load and send a response containing ordering constraints, program references (ids), and info about what each program is meant to instrument.
Yea, that only works with the static dispatcher, with a pre-determined amount of hook functions. Because the dispatcher must be loaded before you can load the PROG_EXT programs. You cannot load the PROG_EXT programs first and then create/load the dispatcher. That is because you need to pass attach_btf_id and attach_prog_fd when loading the PROG_EXT programs.
I also don't like needing the explicit ack/release in the interface, as this seems like a recipe for resource leaks. For example, what if Cilium shuts down between the first phase and second phase? I'd rather not have plugins worry about the FD lifecycles and needing to cleanup in cases like this.
That is a fair point.
It might be useful for a plugin to insert Cilium programs into a PROG_ARRAY. An interface more like what you describe below could be useful for that eventually.
That might work. So then in the second request Cilium would send the ID of a PROG_ARRAY, the plugin puts the file descriptors in there for transfer before sending a response. Cilium takes them from there and attaches them.
So:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, prog_array ID + hook-to-index mapping
- plugin -> Cilium, send OK (PROG_EXT programs in designated map indexes)
This way the plugin would only have to load the programs once it receives 3, and does not have to worry about holding onto the FD or pinning it once written to the map and before sending 4.
On the Cilium side we would not pin this PROG_ARRAY map, so if Cilium restarts after sending 3 but before 4, the map gets cleaned up along with the not yet attached programs.
We would have to try out if you can indeed store PROG_EXT programs in PROG_ARRAY maps.
In the second round of requests, the plugins BPF_LINK_GET_FD_BY_ID to get an FD to links, call BPF_LINK_UPDATE to replace the programs attached to the links, and close the link FD before responding. The program lifecycles would then be tied to that of the links which are always owned by Cilium. This adds more complexity and boilerplate for the plugins, but I think that could be helped with some kind of shim library. I'm also not sure if BPF_LINK_UPDATE would work in this context. If it does, this seems nice since any FD the request opens is closed before it responds.
Also an interesting idea, but I don't think its allowed by the kernel. If I read this correctly then BPF_LINK_UPDATE is optional per link type. And bpf_struct_ops_link_lops does not implement it.
If we go with explicitly providing a list of installed plugins, I think the single socket approach is more elegant. The multiple unix sockets would only be nessesery as an implicit way of communicating the installed plugins.
Yeah true, although I think you might need some kind of Register or GetDriverInfo call in the gRPC interface to identify the driver, since the identity of the plugin isn't implicit in this case. For now, let's aim for a single unix socket. If we discover some reason why this is more difficult than having multiple sockets then we can make an adjustment. I suspect this should work though.
The "waiting for plugins" on startup also brought up some other questions. Mainly around what happens when a plugin does not connect or respond to requests in time. Presumably we would have some sort of timeout after which we have to make a decision: fail endpoint creation or continue without the plugin. If a plugin is crucial for connectivity or security then failure is preferable. For observation plugins, ignoring them would be. Perhaps that would also be something to include in the explicit registration part --datapath-plugins=a:crucial,b:optional (not sure what a nice format is for the flags).
For optional plugins, there's some more nuance as well: once connectivity is restored do you want to regenenerate endpoints or not. I can see this being another configuration option for a driver.
Perhaps a CRD would be the way to go. I imagine that would be easier for users, since the custom resource would be part of the helm chart of the plugin, and not Helm values to modify in Cilium's chart.
Yeah, agree. I also think this lends itself well to adding more driver-specific config options like the "required vs optional" and "trigger regeneration on connect" settings.
With this, we'd also need to consider what happens when drivers are unregistered or registered (create or delete one of these CRDs) while Cilium is running. I think the exact implementation logic can wait for a PR, but suffice to say these events would need to trigger endpoint regeneration.
That might work. So then in the second request Cilium would send the ID of a PROG_ARRAY, the plugin puts the file descriptors in there for transfer before sending a response. Cilium takes them from there and attaches them.
So:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, prog_array ID + hook-to-index mapping
- plugin -> Cilium, send OK (PROG_EXT programs in designated map indexes)
This way the plugin would only have to load the programs once it receives 3, and does not have to worry about holding onto the FD or pinning it once written to the map and before sending 4.
On the Cilium side we would not pin this PROG_ARRAY map, so if Cilium restarts after sending 3 but before 4, the map gets cleaned up along with the not yet attached programs.
We would have to try out if you can indeed store PROG_EXT programs in PROG_ARRAY maps.
Unfortunately, I don't think this will be possible. Browsing through the code I spotted this check which would prevent insertion of such a program.
static void *prog_fd_array_get_ptr(struct bpf_map *map,
struct file *map_file, int fd)
{
struct bpf_prog *prog = bpf_prog_get(fd);
bool is_extended;
if (IS_ERR(prog))
return prog;
if (prog->type == BPF_PROG_TYPE_EXT ||
!bpf_prog_map_compatible(map, prog)) {
bpf_prog_put(prog);
return ERR_PTR(-EINVAL);
This was added fairly recently in this patchset:
Moreover, an extension program should not be tailcalled. As such, return
-EINVAL if the program has a type of BPF_PROG_TYPE_EXT when adding it to a
prog_array map.
Also an interesting idea, but I don't think its allowed by the kernel. If I read this correctly then BPF_LINK_UPDATE is optional per link type. And bpf_struct_ops_link_lops does not implement it.
I believe bpf_tracing_link_lops would be the relevant struct bpf_link_ops, but the fact remains the same. It looks like update() is unfortunately not implemented :(.
That's certainly frustrating. I will think on this a bit more later today to see if there might be a better alternatives.
One variation is that I'm fairly sure doesn't work but that I'll outline for posterity is one in which the plugin itself does the program replacement in phase two:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID + hook mapping]s
- plugin -> Cilium, send OK (subprograms replaced)
Once the plugin closes the FD for any links it creates at the end of the request it would detach the hooks it just attached. In other words, the only reference held to the hooks would be the links themselves. The plugin could pin the links, but then you're back to square one with a stateful plugin interface that needs additional lifecycle operation for garbage cleanup.
Since this thread is getting long, I wanted to summarize the approaches considered so far for Cilium-plugin coordination:
Option 1: Plugin Simply Loads Programs
The plugin simply loads program and returns their IDs. It doesn't release any FDs until the third phase when Cilium acknowledges that it has taken ownership of the program.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, ask for program IDs
- plugin -> Cilium, send program IDs, ask for ACK
- Cilium -> plugin, ACK (allow plugin to release FDs)
Pros:
- Simple gRPC interface maybe?
Cons:
- Resource leaks and tricky coordination: a plugin needs to keep track of open FDs somewhere and handle cases where Cilium is killed before acknowledging a request.
- Implementation could get tricky: bidirectional gRPC or multiple unary requests. Either way there's a lot to consider regarding ephemeral state for the plugin. Plugins would need to consider what the right timeout is before freeing FDs.
Option 2: Plugin Places Programs Into a BPF_MAP_TYPE_PROG_ARRAY
Cilium creates a PROG_ARRAY and passes its ID to the plugin in the second request. The plugin inserts the PROG_EXT programs into this map.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, prog_array ID + hook-to-index mapping
- plugin -> Cilium, send OK (PROG_EXT programs in designated map indexes)
Pros:
- Would avoid any need for explicit ACKing for the plugin to free FDs that it creates or other stateful lifecycle operations for plugins.
Cons:
Option 3: Cilium Creates Links And Plugin Updates Their Program
Cilium creates the dispatcher programs and creates links itself with dummy programs. The plugins replace the programs associated with the links using BPF_LINK_UPDATE.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [link IDs + program name]s
- Cilium -> plugin, send [link ID + program name]s
- plugin -> Cilium, send OK
Pros:
- Would avoid any need for explicit ACKing for the plugin to free FDs that it creates or other stateful lifecycle operations for plugins.
Cons:
- It's not possible to update the program for a tracing link 1
Option 4: Plugin Pins Links To BPFFS
Before the second request, Cilium designates a pin path for each hook. Plugins load programs for each hook and pin them to the designated path. These pins are inherently ephemeral.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID ]s, program-to-pin-path mapping
- plugin -> Cilium, send OK (programs pinned to designated pin paths)
Pros:
- Avoids any need for explicit ACKing for the plugin to free FDs that it creates or other stateful lifecycle operations for plugins.
Cons:
- Plugins need to share the same BPFFS as Cilium which complicates configuration.
- Race conditions were mentioned above, though I'm not sure if there is a specific scenario which applies here where this would be a concern.
Option 5: Plugin Replaces Programs Itself
The plugin itself does the program replacement.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID + hook mapping]s
- plugin -> Cilium, send OK (subprograms replaced)
Pros:
- In theory it avoids any need for explicit ACKing for the plugin to free FDs that it creates or other stateful lifecycle operations for plugins.
Cons:
- In actuality, once the plugin closes the FD for any links it creates at the end of the request it would detach the hooks it just attached. In other words, the only reference held to the hooks would be the links themselves. The plugin could pin the links, but then you're back to square one with a stateful plugin interface that needs additional lifecycle operation for garbage cleanup.
Option 6: Put The Plugin In Charge Of Program Lifecycles
The plugin is expected to pin its own programs or links. There is no coordination about pin paths between Cilium and the plugin. It's OK if the plugin closes FDs after it pins the program.
Attachment:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID ]s
- plugin -> Cilium, send OK (program or link IDs)
Later, when endpoint is destroyed:
- Cilium -> plugin, send unregister/unload
- plugin -> Cilium, send OK
Additional lifecycle hooks might look something like:
// Used for full synchronization after restarts: Cilium lists endpoints known by plugin and either
// registers or unregisters them.
rpc ListEndpoints()
// Notify the plugin that a new endpoint exists. Add elements to maps, etc.
rpc RegisterEndpoint()
// Notify the plugin that an endpoint no longer exists. Remove elements from maps, unpin programs, etc.
rpc UnregisterEndpoint()
Pros:
- Avoids any need for explicit ACKing for the plugin to free FDs.
Cons:
- This opens up a box of questions around plugins and program/link lifecycle management. Cilium must match any load operation with an eventual unload or sync operation with the plugin that allows it to clean up garbage. This must work even through restarts.
- If a plugin deletes the pin to a link it could unload the program from the hook. A well-behaved plugin shouldn't do this, but it opens up more opportunity for errors.
Comparison
It's unfortunate that options 2, 3, and 5 don't work, since those would be ideal. Of the options discussed so far, the only viable options are 1 (Plugin Simply Loads Programs), 4 (Plugin Pins Links To BPFFS), and 6 (Put The Plugin In Charge Of Program Lifecycles). Each of these has its own set of drawbacks, so the question comes down to which is the least annoying.
Option 4 requires BPFFS coordination between the plugin and Cilium which complicates configuration. I agree this is less than ideal and would rather deal with purely IDs, but it at least maintains a somewhat stateless interface between Cilium and the plugin (the pin paths are ephemeral and it doesn't really matter if things fail in the middle since Cilium could just clear these out). Options 2, 3, 4, and 5 all share share something in common: they aim to decouple the lifecycle of a program from that of its FD during a request by referencing that program from something Cilium owns. In the case of 2, 3, and 5 those would be actual maps, programs, or links while in the case of option 4 Cilium "owns" the pin path in the sense that it controls its lifecycle. I don't currently see another mechanism we could use to decouple the lifecycle of PROG_EXT programs during the request to the plugin.
Option 1 invites resource leaks IMO. This concern would extend to other mechanisms for transferring ownership over a file descriptor like pidfd_getfd and SCM_RIGHTS, since you need an ack somewhere. These other mechanisms also create additional interface and coordination complexity. Regardless of mechanism, plugins would have to be very careful to keep track of what FDs are "in flight" and have some fallback mechanism in case Cilium dies and recovers in the middle of an endpoint regeneration sequence. Timeouts could ensure unACKed FDs are eventually cleaned up, but I worry that such timeouts could be tricky to manage in practice (is the system just slow or is it dead?).
Option 6 opens up some similar questions as option 1 regarding statefulness. With this approach, plugins are inherently stateful; it's expected that they maintain program pins or other state for any programs that are part of an active endpoint. Cilium knows this and needs to ensure state synchronization across both plugin and Cilium restarts. I actually think this route has certain advantages in terms of what kinds of plugins can be developed, but adds a lot more complexity to Cilium to keep the state in sync.
My feeling is that despite the drawbacks of BPFFS, option 4 creates a clearer contract between Cilium and plugins than option 1. Sorry to bring up the program pin thing again. I'm willing to discuss other options as well, I just wanted to make my concerns clear about some of these other approaches and weight things side by side.
Thank you for the summary.
@ti-mo and I spoke about this offline. Wanted to give a rundown of where we stand. (Timo please correct me if I say anything wrong)
Option 1: Plugin Simply Loads Programs
The plugin simply loads program and returns their IDs. It doesn't release any FDs until the third phase when Cilium acknowledges that it has taken ownership of the program.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID]s, ask for program IDs
- plugin -> Cilium, send program IDs, ask for ACK
- Cilium -> plugin, ACK (allow plugin to release FDs)
Pros:
* Simple gRPC interface maybe?Cons:
* Resource leaks and tricky coordination: a plugin needs to keep track of open FDs somewhere and handle cases where Cilium is killed before acknowledging a request. * Implementation could get tricky: bidirectional gRPC or multiple unary requests. Either way there's a lot to consider regarding ephemeral state for the plugin. Plugins would need to consider what the right timeout is before freeing FDs.
We acknowledge this makes the gRPC API and the implementation more complex. As you mention it would require more logic at the plugin side to coordinate the FD lifetimes, especially during edge cases. We were wondering if something like a SDK / client library could help. This SDK / client library would take care of the unix socket / connection, gRPC and FD lifecycle. The plugin logic could take the form of 2 callbacks, the second of which returns a ebpf.Collection, the SDK / client library will manage the lifecycle complexity.
I don't think its that big of a burden to maintain on the Cilium side, since we would likely want an example client anyway for in CI.
Such an SDK would be in Go, so that does not really help anyone wanting to do Rust or C plugins. Though I hope that is no deal breaker.
Option 4: Plugin Pins Links To BPFFS
Before the second request, Cilium designates a pin path for each hook. Plugins load programs for each hook and pin them to the designated path. These pins are inherently ephemeral.
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID ]s, program-to-pin-path mapping
- plugin -> Cilium, send OK (programs pinned to designated pin paths)
Pros:
* Avoids any need for explicit ACKing for the plugin to free FDs that it creates or other stateful lifecycle operations for plugins.Cons:
* Plugins need to share the same BPFFS as Cilium which complicates configuration. * Race conditions were mentioned above, though I'm not sure if there is a specific scenario which applies here where this would be a concern.
I acknowledge the pros of using BPFFS, specifically allowing plugins to be more stateless. The reason we have resisted this approach is because it essentially introduces a second channel of communication with the possibility of failures.
First the the sharing of the BPFFS, which is inconvenient, but not insurmountable given there already has to be coordination about mounts for the unix socket.
Cilium loads programs, specifically endpoints concurrently. So its not unreasonable that on startup there will be many concurrent plugin interactions. The BPFFS is shared, by multiple threads in a plugin process as well as multiple plugins. So we would need to establish rules / a protocol on the FS layer about directory usage or naming conventions to avoid stomping on each other. The race condition mentioned might happen when one thread loads its programs and pins them, then before the gRPC message is sent, a second thread overwrites the pins, such that the same path points to a different programs, something that would be hard to debug. These can likely be mitigated by using a directory per plugin and a directory per collection being loaded, but it requires plugins to do that properly and not to interfere with each other on the FS.
The concern for resource leakage is also not entirely solved. If a plugin crashes, or is forcibly killed between pinning the programs and sending the gRPC message the pins will linger forever. Cilium did not get the message, and thus will not know its safe to cleanup the pins. The plugin after restarting will not know it needs to do the cleanup unless it persists that data somehow. Same goes if Cilium restarts after programs are pinned but before it processed the gRPC reply. The plugin will have handed it off, and the pins will remain until system restart. You could make a case that with pins its even worse since they stick around until the system restarts, with FDs, the death of a process will cause the programs to be collected.
I am just not convinced its better than the other options.
Sorry to bring up the program pin thing again
No worries, its not a banned topic of conversation. We appreciate you working with us to look for alternatives as well, even though not all of there are possible.
Option 6: Put The Plugin In Charge Of Program Lifecycles
The plugin is expected to pin its own programs or links. There is no coordination about pin paths between Cilium and the plugin. It's OK if the plugin closes FDs after it pins the program.
Attachment:
- Cilium -> plugin, send settings, ask for hook points + constraints
- plugin -> Cilium, send hook points + constraints, ask for [prog ID + BTF type ID]s
- Cilium -> plugin, send [prog ID + BTF type ID ]s
- plugin -> Cilium, send OK (program or link IDs)
Later, when endpoint is destroyed:
- Cilium -> plugin, send unregister/unload
- plugin -> Cilium, send OK
Additional lifecycle hooks might look something like:
// Used for full synchronization after restarts: Cilium lists endpoints known by plugin and either // registers or unregisters them. rpc ListEndpoints() // Notify the plugin that a new endpoint exists. Add elements to maps, etc. rpc RegisterEndpoint() // Notify the plugin that an endpoint no longer exists. Remove elements from maps, unpin programs, etc. rpc UnregisterEndpoint()Pros:
* Avoids any need for explicit ACKing for the plugin to free FDs.Cons:
* This opens up a box of questions around plugins and program/link lifecycle management. Cilium must match any load operation with an eventual unload or sync operation with the plugin that allows it to clean up garbage. This must work even through restarts. * If a plugin deletes the pin to a link it could unload the program from the hook. A well-behaved plugin shouldn't do this, but it opens up more opportunity for errors.
This option is in some sense the easiest because no transfer of ownership happens. But as you point out, now we need to share information about Cilium program lifecycles in order to allow the plugin to cleanup. That is more work for both Cilium and plugins. Sounds worse then option 1 or 4 to me personally.
Regarding BPFFS, I'll preface this by outlining the specific scheme I had in mind to clear up any ambiguity before addressing some of your specific concerns below:
Request Handling
- Cilium maintains a directory within BPFFS for all pins related to in-flight plugin requests, e.g.,
/sys/fs/bpf/cilium/operations. It's possible this could be further subdivided by plugin if it provides any benefit, e.g.,/sys/fs/bpf/cilium/operations/plugin-a. - On a regeneration event, Cilium generates a UUID for the operation and creates a subdirectory within this directory, e.g.,
/sys/fs/bpf/cilium/operations/6a950f84-d8c0-468f-a8e5-c30545fd1ca5 - In the second phase of plugin coordination, Cilium sends a mapping of pin path -> hook to the plugin in the request, e.g.,
{ "mappings": [ { "pin": "/sys/fs/bpf/cilium/operations/6a950f84-d8c0-468f-a8e5-c30545fd1ca5/pre_cil_from_container", "hook": "pre_cil_from_container" }, ... ] } - The plugin loads the hook programs and pins them to the relevant paths before responding to the request.
- Cilium deletes the subdirectory,
/sys/fs/bpf/cilium/operations/6a950f84-d8c0-468f-a8e5-c30545fd1ca5, after consuming all pins provided by plugins in that round of coordination.
Garbage Cleanup
- If an in-flight request times out or fails in some way other than an explicit nack from the plugin, Cilium can't be sure if the operation is still in progress, but it needs to cancel the operation and ensure that any future pin attempts for that request will fail. So, it deletes the parent directory for the request,
/sys/fs/bpf/cilium/operations/6a950f84-d8c0-468f-a8e5-c30545fd1ca5. Any future attempts to create pins in this directory will fail and in-flight operations will eventually stop trying. - If for some reason Cilium shuts down and restarts in the middle of this, it clears out the parent directory for all plugin pin operations on startup,
/sys/fs/bpf/cilium/operations, which would have a similar effect.
Cilium loads programs, specifically endpoints concurrently. So its not unreasonable that on startup there will be many concurrent plugin interactions. The BPFFS is shared, by multiple threads in a plugin process as well as multiple plugins. So we would need to establish rules / a protocol on the FS layer about directory usage or naming conventions to avoid stomping on each other.
I think this would be largely mitigated with the scheme outlined above. Each operation would have its own directory within which it operates. Plugins never determine where pins are created, they just listen to what Cilium tells them to do. In a sense, this is no more complex than the PROG_ARRAY or link approaches outlined above where Cilium sends the plugin some kind of mapping between slot/handle and hook; it's just that the handles in this case are file paths.
The race condition mentioned might happen when one thread loads its programs and pins them, then before the gRPC message is sent, a second thread overwrites the pins, such that the same path points to a different programs, something that would be hard to debug. These can likely be mitigated by using a directory per plugin and a directory per collection being loaded, but it requires plugins to do that properly and not to interfere with each other on the FS.
Keeping the scheme above in mind, the scenario you describe here (two concurrent threads writing to the same pin paths) shouldn't be possible unless plugins are doing something egregious.
These can likely be mitigated by using a directory per plugin and a directory per collection being loaded, but it requires plugins to do that properly and not to interfere with each other on the FS.
Looking at this from the plugin perspective, this doesn't seem significantly more complex to understand and implement correctly than the PROG_ARRAY with an index -> hook mapping or the link -> hook mapping approaches we considered before. In all of these cases, you would need to correctly associate a program to some handle provided to you in the request with an additional BPF operation, whether that be a map insertion, link update, or pin creation.
The concern for resource leakage is also not entirely solved. If a plugin crashes, or is forcibly killed between pinning the programs and sending the gRPC message the pins will linger forever. Cilium did not get the message, and thus will not know its safe to cleanup the pins.
In this scenario either Cilium would eventually deem the request failed (via request timeout or otherwise) or Cilium would shut down before that point and clear out the plugin request pin path directories on startup anyway.
The plugin after restarting will not know it needs to do the cleanup unless it persists that data somehow. Same goes if Cilium restarts after programs are pinned but before it processed the gRPC reply. The plugin will have handed it off, and the pins will remain until system restart. You could make a case that with pins its even worse since they stick around until the system restarts, with FDs, the death of a process will cause the programs to be collected.
Just to reiterate, Cilium would do cleanup when it restarts. The plugin restarting isn't such a problem as outlined in the last point. I suppose you could argue that if Cilium is stopped and never restarted then pins could stick around, but that's true today of other pins as well.
I am just not convinced its better than the other options.
Judging from our previous discussion, I think we both agree that the BPF_MAP_TYPE_PROG_ARRAY or BPF links approaches would be ideal if they worked. To me, the BPFFS approach is a way to maintain that same interaction model between Cilium and plugin, ensuring plugins can remain relatively stateless. Personally, I feel that the extra coordination you'd need in option 1 to maintain FD lifecycles is worse and shifts complexity to the plugin and its interface instead of that complexity living in Cilium. IMO the pin path approach actually gives Cilium more control over lifecycles and garbage collection. As you mention above, this would require heavier client library support and raise the barrier to entry for other languages, etc. Mount point coordination between Cilium and the plugin is a slight wart, but nothing terrible to manage (similar requirements exist for systems like CSI, e.g.).
If you're still not convinced, then I'll concede and we can try to mitigate the drawbacks of option 1 as best as possible.
I think the request handling as described addresses my concerns with the BPFFS approach. Having these temporary and unique directories solves collision and cleanup concerns I had. I can personally see it working this way.
Just a quick update on where I'm at: I just got back from KubeCon where I talked about this during the Cilium dev summit. This week and next week I'll try to distill down all that we've discussed and update the CFP itself. Hopefully that will put us in a good place to iron out some of the smaller details and let us merge this some time soon.
OK, I went ahead and updated the CFP. It should capture most of what we've discussed, but let me know if you feel anything is missing. One thing that occurred to me w.r.t. passing program or map IDs to a plugin is that translation of IDs into FDs requires CAP_SYS_ADMIN on the part of the plugin. This may be worth considering in the future as we think about how Cilium shares programs/maps with a plugin. For basic use cases, plugins are just loading and pinning their own programs.