cluster-api-provider-kubevirt icon indicating copy to clipboard operation
cluster-api-provider-kubevirt copied to clipboard

Provide an alternate qemu-guest-agent strategy to check VM's CAPI Sentinel file

Open BarthV opened this issue 1 year ago • 3 comments

Context

Following last Syncup Meeting (Tuesday April 18), we are now targeting multiple new features :

  1. short term : bypassing Sentinel file check with a "none" policy
  2. mid term : working on a alternate qemu-guest-agent usage to check file
  3. long term : making the whole SSH logic optional (for keygen & cloud-init management...) with something like a Sentinel File check mode selector in spec

This Issue tracks mid term item 2.

Feature Description

Refering to https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/issues/230 , we need to check bootstrap status using something else than SSH, and only rely on the Kubevirt apiserver for everything (scheduling , probing, Sentinel checking, ...)

https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/issues/233 needs to be implmented first before start working on this issue.

Related Issues

  • https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/issues/230
  • https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/issues/233

BarthV avatar Apr 25 '23 10:04 BarthV

quick update on this issue

I've been working for a couple days on it, and I'm now quite confident to says that current implementation is creating major problems on the road.

I'll try to explain as clearly as I can what is blocking us & what should be changed to implement qemu-guest-agent based checks in the VM's virt-launcher pod.

What is going on (during bootstrap check) ?

  1. Machine reconciliation is triggered in kubevirt_machinecontroller.go
  2. Reconcilier fetches & carries a global context with multiple compiled informations (vm, vmi, kubevirtcluster, kubevirtmachines, ssh keys ... )
  3. Reconcilier builds the infra cluster kubernetes client
    • Depending the kubevirt topology (local or external), machine controller reuses existing capk controller client (based on controller-runtime) or builds a new one from a kubeconfig (based on controller-runtime too)
  4. Bootstrap check is done in machine.go implementation that receive global context & manipulates this controller-runtime client to configure & trigger a SSH check
  5. Bootstrap result is returned to reconciler

What is the problem ?

If we want to check bootstrap status using an exec command, we need the kubernetes client supporting such a feature.

But it seems that controller-runtime client is not able to run exec command on pods at all ( https://github.com/kubernetes-sigs/controller-runtime/issues/1229 ). So the only (good) way to be able to trigger a pod exec from the code would be to rebuild a new kube client based on k8s.io/client-go (which supports spawning pod exec commands)

Although it's pretty easy to say, such a change requires extracting controller-runtime client's configuration in order to stay compatible with local / external setups. This configuration must therefore be extracted after building (or reusing) the infrastructure client (during step 3).

Also, extracting kube client configuration is not as simple as it looks. If we're reusing local controller client, it's not only "parsing a kubeconfig file" and we're might need to change multiple files in many packages that have a global impact on CAPK.

The main problem I see it that we're not able to return any information needed to spawn a new client (rest.Config{}) if we're dealing with an internal infra client. ( https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/blob/main/pkg/infracluster/infracluster.go#L42-L44 ) ... So we'll have to make specific code for both internal & external cases up to the end. This make the whole configuration & client management a bit more complex.

Why it has a big impact ?

So far, I've identified & tested several things to implement to support such a change :

  • we need to make the infra cluster configuration available to the in the controller global context
  • this configuration cannot be extracted from an existing controller-runtime client. So it has to be created during the client build stage.
  • the machine bootstrap checker should be aware of this configuration, and build a ephemeral kube client using k8s.io/client-go lib
  • then it should use it when qemu-guest-agent check is asked

k8s.io/client-go lib allows building a new REST client from a rest.Config{} struct. We can easily return it during client generation because we already working with it (https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/blob/main/pkg/infracluster/infracluster.go#L75-L78).

Building a REST client also requires codecs & gvk scheme that we may carry along the way too (https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client/apiutil?utm_source=godoc#RESTClientForGVK). A more "standard" client can perfectly be created too using the native go-client client lib ( https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client#New ), but managing schemes & gvk are still a topic with with method.

technical focus

In order to make this working we'd need several changes :

  • GenerateInfraClusterClient func, in addition from returning the right controller-runtime client, should also returns k8s.io/client-go/rest rest.Config configuration.
  • this configuration should be carried along the rest of the reconiliation context
  • All controllers should ignore this new configuration except bootstrap_controller that will pass it to externalMachine object (r.MachineFactory.NewMachine)
  • the externalMachine implementation will avoid using an existing controller-runtime client but rather take a rest.Config{} as input and build a generic go client that is both able to check CRD specs & status but also run exec command depending the bootstrap check mode

Since the rest.Config{} is sufficent to create a new client but is also transparent for both local and remote configurations, it seems to be a good pivot to make bootstrap check a bit more independant from main controller kube client.

Proposed action items

  1. Probably extract manager's main client config (in main.go) and load it in the reconciler clobal context
    • This would allow us to work exclusively with rest.client ( https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/blob/main/main.go#L111 ) whatever if we're dealing with a local infra cluster or an external one
  2. Extract rest.config{} during controller client generation. Update all dependencies & tests
  3. Make rest.config{} available in context
  4. Adapt machine controller & external machine implementation to support rest.config{} for every kube request instead of existing controller-runtime client
    • this is probably better than maintaining 2 clients for the same infra cluster in the same package

I've already started a concept (https://github.com/BarthV/cluster-api-provider-kubevirt/tree/guestagentcheck), but the change is huge and I realized that it cannot fit in a single PR.

Let's discuss about this plan (or any other possible ones) on next meeting

BarthV avatar Jun 05 '23 14:06 BarthV

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 22 '24 03:01 k8s-triage-robot

/remove-lifecycle stale

agradouski avatar Feb 06 '24 16:02 agradouski

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 06 '24 16:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 05 '24 17:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 05 '24 17:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 05 '24 17:07 k8s-ci-robot