design-cfps icon indicating copy to clipboard operation
design-cfps copied to clipboard

CFP-42453: Oracle Cloud Infrastructure (OCI) Cloud Provider Design

Open trungng92 opened this issue 1 month ago • 5 comments

Cilium Issue Link

As discussed during the Cilium Weekly Community meeting, this is a CFP that starts a discussion on various possible integrations with Cilium and will ultimately help determine which solution works best for OCI.

trungng92 avatar Oct 28 '25 14:10 trungng92

I think @antonipp summarized the state pretty well :+1: The one thing I would add around option (2) is just that it has been a struggle to find people willing to help maintain IPAM code. The cloud SDKs are large dependencies that we have minimal understanding about, and I see that as a maintenance burden and risk for the project as we try to keep them up to date. I'm not excited about the idea of adding yet another cloud SDK.

joestringer avatar Nov 04 '25 02:11 joestringer

On the topic of the highlighted text from this thread, what do you mean by "support model"?

Thanks for the replies. What @antonipp discussed is what I was looking after. Essentially, in a solution where there's an in-tree OCI integration with Cilium, who becomes responsible for the OCI integration? Cilium? (Of course I wouldn't want to put additional burden on your team 🙂) OCI? The community?

I think @antonipp summarized the state pretty well 👍 The one thing I would add around option (2) is just that it has been a struggle to find people willing to help maintain IPAM code. The cloud SDKs are large dependencies that we have minimal understanding about, and I see that as a maintenance burden and risk for the project as we try to keep them up to date. I'm not excited about the idea of adding yet another cloud SDK.

In a typical integration between two services (Service A and Service B), ideally:

  • If Service A adds a new feature, they validate that the feature works with Service B
  • And if Service B adds a new feature, they validate that the feature works with Service A

But I know this doesn't always happen in the real world (as @antonipp gave good examples for). Also, it's not a 1:1 model, it's a 1:many model. I can understand that testing every cloud provider for every feature would be difficult for Isovalent which could lead to features getting stale/left behind for certain cloud providers.

I am strongly leaning towards attaching a CIDR block and updating the v1.Node.spec.podCIDR field (option 1) in CCM, as it is the simplest solution to maintain (from OCI's point of view, as long as Isovalent maintains the contract of processing the podCIDR field) and will work for a majority of our customers.

The main drawback is that it isn't compatible with a multi-NIC solution. At the same time though, there doesn't seem to be any "easy" solutions for multi-NIC use cases, and any multi-NIC solution will require a deeper in-tree or out-of-tree integration. Aside from cloud providers implementing multi-NIC, does Cilium have any expectations around NIC usage? Is the default behavior/expectation that Cilium will just use the default ip route as specified by the OS?

trungng92 avatar Nov 04 '25 19:11 trungng92

Cc: @rbtr

xmulligan avatar Nov 13 '25 00:11 xmulligan

The main drawback is that it isn't compatible with a multi-NIC solution. At the same time though, there doesn't seem to be any "easy" solutions for multi-NIC use cases, and any multi-NIC solution will require a deeper in-tree or out-of-tree integration. Aside from cloud providers implementing multi-NIC, does Cilium have any expectations around NIC usage? Is the default behavior/expectation that Cilium will just use the default ip route as specified by the OS?

We've also been exploring multi-NIC and so far are only offering it via AzCNI because we didn't want to dump a bunch of impl specific code in Cilium and there was no standard contract for it.

But isn't this the promise of DRA(NET) and NRI?

rbtr avatar Nov 19 '25 04:11 rbtr

I don't have experience with NRI, but we have looked into DRA a bit.

For anyone who needs a background on Dynamic Resource Allocation, the idea is that you can add "attributes" to devices (e.g. NICs). And then in your pod, you can request devices that meet specified attribute requirements.

It does seem useful for multi-NIC cases, although it might require the consumable capacity feature to be usable (multiple pods connecting to the same VNIC):

https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#consumable-capacity

Perhaps the answer to generic multi-NIC support for Cilium should be to wait for this feature to be available.

trungng92 avatar Nov 25 '25 21:11 trungng92