cri-resource-manager icon indicating copy to clipboard operation
cri-resource-manager copied to clipboard

pkg/cri: switch to CRI v1 API, add bridging for pre-v1 clients.

Open klihub opened this issue 3 years ago • 13 comments

Switch to talking CRI v1 to the CRI runtime/server. Bridge any pre-v1 (only CRI v1alpha2) client requests, unless this is explicitly disabled on the command line.

klihub avatar Mar 01 '22 13:03 klihub

Codecov Report

Merging #781 (a3c57f5) into master (6b458ca) will not change coverage. The diff coverage is 22.22%.

@@           Coverage Diff           @@
##           master     #781   +/-   ##
=======================================
  Coverage   34.87%   34.87%           
=======================================
  Files          60       60           
  Lines        8874     8874           
=======================================
  Hits         3095     3095           
  Misses       5477     5477           
  Partials      302      302           
Impacted Files Coverage Δ
pkg/cri/resource-manager/cache/container.go 17.33% <4.76%> (ø)
pkg/cri/resource-manager/cache/pod.go 18.15% <33.33%> (ø)
pkg/cri/resource-manager/cache/cache.go 22.01% <45.45%> (ø)
pkg/cri/resource-manager/cache/utils.go 34.69% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Mar 01 '22 18:03 codecov-commenter

In general, I'm ok with all of this. One separate worry that I have is about "unsafe.Pointer()" cast for practically 100% of data structures: does match in all cases of we have something that might be problematic?

The CRI-O guys landed a few commits in the v1alpha2 CRI implementation to bring it struct-wise identical to v1 with the explicit goal of being able to do this type of near-zero-cost protocol bridging. The CRI-O implementation is essentially doing identical thing to what we do here, and that is the only reason why I dared to go with that instead of a convert-by-copying approach.

klihub avatar Mar 02 '22 17:03 klihub

LGTM. Right now the option is enabled by default, is the plan to disable it by default or remove it, in the future?

Yes, that's exactly what I had in mind: first we have it as opt-out, once we're sure we can expect mostly everybody to talk v1 we make it opt-in, and eventually we remove it altogether.

klihub avatar Mar 02 '22 17:03 klihub

Tested on debian-sid with

  • kubelet v1.19.0 (CRI v1alpha2) + containerd v1.6.0-41-gc523102b5
  • kubelet v1.23.4 (CRI v1) + containerd v1.6.0-41-gc523102b5 ...and it works fine.

Also tried out with containerd v1.4.0, that talks CRI v1alpha2. As promised in the commit message, this is not supposed to work. cri-resmgr output:

F0303 07:32:12.850610    4447 main.go:89] failed to start resource manager: resource-manager: cache synchronization pod query failed: rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService

My only question would be related to this. Merging the PR will prevent using CRI-RM master branch with older container runtime versions, which will make it harder to integrate CRI-RM to certain environments. On the other hand, without this PR, CRI-RM works with old and new kubelet and runtimes as our primary backend runtimes implement compatibility layers.

What do you think, should we add support for v1alpha2 towards CRI runtimes, too? (In this PR or in a separate PR?)

The initial idea was to support bridging in both directions. Then I quite quickly changed my mind regarding bridging from new clients to old runtimes. I think that is neither worth the effort nor necessarily very safe: I'm not sure if clients talking the new protocol version would expect talking to that old a runtime. Also looking at the official EOL chart, I think dropping support for 1.19/older should not be a problem...

If we consider keeping v1alpha2 support around for a longer time, then I think using a dedicated branch/per-release branches/binary/binaries for it could/would probably be the way to go. That branch/those branches would simply have one additional commit swapping v1 for v1alpha2.

WDYT?

klihub avatar Mar 03 '22 14:03 klihub

@askervin @jukkar @kad @marquiz PTAL, I had to rebase this already approved one to resolve a conflict.

klihub avatar Jun 14 '22 10:06 klihub

@kad, @marquiz, @askervin, @jukkar I consider merging this for the 0.7.0 release... WDYT ?

klihub avatar Jun 15 '22 21:06 klihub

@kad, @marquiz, @askervin, @jukkar I consider merging this for the 0.7.0 release... WDYT ?

Okay, after testing that idea against our current e2e test set, it's better to keep this one out for now. Many of our tests pull in old runtimes without support for the v1 protocol.

klihub avatar Jun 16 '22 06:06 klihub

@kad @marquiz @askervin @jukkar @fmuyassarov Once we are 100 % sure that we don't want/need to support any containerd or cri-o version which does not speak CRI v1, this should be good to go in.

klihub avatar Sep 08 '22 11:09 klihub

In our e2e test point of view... I didn't run full tests here, but checked that cri-resmgr can talk to distro's on-the-stock containerd:

distro=debian-sid: ok

distro=ubuntu-22.04: ok

distro=ubuntu-20.04: fail (containerd-1.5.9)

root@vm> cri-resmgr -relay-socket /var/run/cri-resmgr/cri-resmgr.sock -runtime-socket /var/run/containerd/containerd.sock -image-socket /var/run/containerd/containerd.sock -force-config cri-resmgr.cfg   >cri-resmgr.output.txt 2>&1 &
cri-resmgr last output line:
F: [   cri-resmgr   ] failed to start resource manager: resource-manager: failed to query runtime version: rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService

distro=centos-8: ok

distro=fedora: ok

ubuntu-20.04 end-of-life is Aug/2025. What do you think?

askervin avatar Sep 08 '22 12:09 askervin

distro=ubuntu-20.04: fail (containerd-1.5.9) ... ubuntu-20.04 end-of-life is Aug/2025. What do you think?

I'm trying to become a better human being, so I can't/won't tell you that...

But I can see 3 choices here:

  1. We forget about this PR for now.
  2. We add dual-version bridging to the egress side (relay/server) as well.
  3. If we think there is a viable more recent alternative to 20.04, we stop supporting it despite it not being EOL'd.

klihub avatar Sep 08 '22 13:09 klihub

3. If we think there is a viable more recent alternative to 20.04, we stop supporting it despite it not being EOL'd.

Hmm, please do not remove the support for 20.04 yet, I am using it in the VM with github actions.

jukkar avatar Sep 08 '22 13:09 jukkar

  1. If we think there is a viable more recent alternative to 20.04, we stop supporting it despite it not being EOL'd.

Hmm, please do not remove the support for 20.04 yet, I am using it in the VM with github actions.

Okay... I wonder if we has this enabled for our repo:

/hold

klihub avatar Sep 08 '22 14:09 klihub

  1. If we think there is a viable more recent alternative to 20.04, we stop supporting it despite it not being EOL'd.

Hmm, please do not remove the support for 20.04 yet, I am using it in the VM with github actions.

Okay... I wonder if we has this enabled for our repo:

/hold

Hmm. The robots I'm looking for are not these...

klihub avatar Sep 08 '22 14:09 klihub

We'll need to pick this up again and get it eventually merged: v1alpha2 is going the way of the dodo bird.

klihub avatar Nov 04 '22 09:11 klihub