gcp-compute-persistent-disk-csi-driver icon indicating copy to clipboard operation
gcp-compute-persistent-disk-csi-driver copied to clipboard

Migrate to csi proxy v2 alpha

Open alexander-ding opened this issue 3 years ago • 67 comments
trafficstars

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change /kind bug /kind cleanup /kind design /kind documentation /kind failing-test /kind feature /kind flake

What this PR does / why we need it:

Migrates GCP Compute PD CSI Driver to use CSI Proxy v2. Besides updating the Go package code to use the new library, we also update the deployment files to run the driver as HostProcess containers. See the linked issue.

Which issue(s) this PR fixes:

Fixes #1070

Special notes for your reviewer:

This is a work in progress. We still need to update documentation to reflect the increased minimal Kubernetes version.

Does this PR introduce a user-facing change?:

Migrated Windows CSI node plugin to HostProcess containers

Action required: Ensure that the k8s version is 1.23+ and set the security policies
in the cluster to enable the HostProcess container feature only to system workloads.

alexander-ding avatar Oct 31 '22 19:10 alexander-ding

Hi @alexander-ding. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 31 '22 19:10 k8s-ci-robot

/cc @mauriciopoppe

alexander-ding avatar Oct 31 '22 19:10 alexander-ding

/ok-to-test /test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019

There's an optional presubmit job for Windows, please look for it in the test-infra codebase

mauriciopoppe avatar Oct 31 '22 19:10 mauriciopoppe

It was not /test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019 :(

mauriciopoppe avatar Oct 31 '22 19:10 mauriciopoppe

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

alexander-ding avatar Oct 31 '22 20:10 alexander-ding

I haven't run /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019 in a while so it might have drifted from the CI one, I'd take that one and compare it with the presubmit

mauriciopoppe avatar Oct 31 '22 20:10 mauriciopoppe

/retest

alexander-ding avatar Oct 31 '22 21:10 alexander-ding

The last run https://testgrid.k8s.io/provider-gcp-compute-persistent-disk-csi-driver#windows-2019-presubmit-gcp-compute-persistent-disk-csi-driver is red but here it's green, probably you can check again after removing the TODO

mauriciopoppe avatar Nov 01 '22 01:11 mauriciopoppe

One more thing, I created a few tracking issues a while ago in this repo too, I think we should use those

https://github.com/kubernetes-csi/csi-proxy/issues/217 tracks the issues in this repo

mauriciopoppe avatar Nov 01 '22 16:11 mauriciopoppe

New changes are detected. LGTM label has been removed.

k8s-ci-robot avatar Nov 01 '22 19:11 k8s-ci-robot

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

mauriciopoppe avatar Nov 01 '22 19:11 mauriciopoppe

Thanks for the PR @alexander-ding !

/approve

I'll let @mauriciopoppe give the final lgtm when all the comments are fixed.

mattcary avatar Nov 01 '22 19:11 mattcary

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexander-ding, mattcary

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 01 '22 19:11 k8s-ci-robot

/cc @mattcary requesting a re-review for the latest commit. I switched over to the new hpc image that @mauriciopoppe suggested and had to change a few things in the Makefile and Dockerfile to tag the image. Could you take a look and see if there's anything we would need to fix in automated testing/release pipelines?

alexander-ding avatar Nov 02 '22 16:11 alexander-ding

Thanks for the ping. I think this should be fine. @mauriciopoppe do you agree this will work with our internal louhi pipeline? The BASE_IMAGE will be ignored but we will still tag the manifest. I guess we might end up building identical images multiple times (ie a couple of ltsc, 20h2) where really we should have one image with multiple tags?

mattcary avatar Nov 02 '22 16:11 mattcary

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

alexander-ding avatar Nov 02 '22 17:11 alexander-ding

Thanks for the ping. I think this should be fine. @mauriciopoppe do you agree this will work with our internal louhi pipeline? The BASE_IMAGE will be ignored but we will still tag the manifest. I guess we might end up building identical images multiple times (ie a couple of ltsc, 20h2) where really we should have one image with multiple tags?

I took a look. I think the pipeline will do redundant work but should be fine otherwise. We can refactor the Makefile here and the Louhi pipelines together when the upstream issue is actually fixed.

alexander-ding avatar Nov 02 '22 18:11 alexander-ding

Awesome, thanks for following through.

I'll let Mauricio give the final lgtm, my approve should still be good.

On Wed, Nov 2, 2022 at 11:19 AM Alexander Ding @.***> wrote:

Thanks for the ping. I think this should be fine. @mauriciopoppe https://github.com/mauriciopoppe do you agree this will work with our internal louhi pipeline? The BASE_IMAGE will be ignored but we will still tag the manifest. I guess we might end up building identical images multiple times (ie a couple of ltsc, 20h2) where really we should have one image with multiple tags?

I took a look. I think the pipeline will do redundant work but should be fine otherwise. We can refactor the Makefile here and the Louhi pipelines together when the upstream issue https://github.com/containerd/containerd/issues/7431 is actually fixed.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/pull/1071#issuecomment-1301041583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIJCBAHXVNUTWMUCVSHC46DWGKWBNANCNFSM6AAAAAARTNJCOA . You are receiving this because you were mentioned.Message ID: <kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/pull/1071/c1301041583 @github.com>

mattcary avatar Nov 02 '22 18:11 mattcary

/retest

alexander-ding avatar Nov 02 '22 19:11 alexander-ding

Test failed but it seemed like a flake. The test was waiting for the external injector pod to come online, but once it did it complains "container not found". I don't see how this is related to my changes. Also, for some reason the optional windows test has no test history.

I had to run retest earlier for the regular e2e test as well due to a flake. Might be something worth looking into if this keeps happening

alexander-ding avatar Nov 02 '22 19:11 alexander-ding

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

alexander-ding avatar Nov 07 '22 22:11 alexander-ding

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

alexander-ding avatar Nov 17 '22 23:11 alexander-ding

@mattcary we're trying to make changes in the base manifests but are unsure about modifying the overlays here and also the changes in the internal manifests, I thought about a couple of options and wanted to know your thoughts

  • modify the base manifests with v2 and add overlays for low k8s versions that add back the v1 config
  • modify the overlays with v2 versions in the k8s version that support HPC

Is there another option that you suggest?

I'd prefer not to minimize touching the old version manifests, because we don't have much in the way of testing around them I think to confirm that they actually work. So I think it's safer to add overlays for dev, prow, stable-master for v2.

Then maybe when we start rolling off old versions (eg 1.21 and 1.22 should drop soon as they are EOL, 1.23 is going EOL early next year) we can switch and have an overlay for the v1 and have v2 in the base.

I think this also means you should copy stable-master to a new 1.25 version that uses v1, then have stable-master with an overlay for v2.

Does that all make sense?

mattcary avatar Nov 18 '22 00:11 mattcary

@mattcary we're trying to make changes in the base manifests but are unsure about modifying the overlays here and also the changes in the internal manifests, I thought about a couple of options and wanted to know your thoughts

  • modify the base manifests with v2 and add overlays for low k8s versions that add back the v1 config
  • modify the overlays with v2 versions in the k8s version that support HPC

Is there another option that you suggest?

I'd prefer not to minimize touching the old version manifests, because we don't have much in the way of testing around them I think to confirm that they actually work. So I think it's safer to add overlays for dev, prow, stable-master for v2.

Then maybe when we start rolling off old versions (eg 1.21 and 1.22 should drop soon as they are EOL, 1.23 is going EOL early next year) we can switch and have an overlay for the v1 and have v2 in the base.

I think this also means you should copy stable-master to a new 1.25 version that uses v1, then have stable-master with an overlay for v2.

Does that all make sense?

@mattcary to make sure we're on the same page, you're suggesting that we

  1. Keep the base manifests as is
  2. Rename stable-master to stable-1.25
  3. Create a new stable-master that overlays v2 stuff on the base manifest
  4. Modify dev, noauth, and prow overlays to use v2 as well

Is that right?

Note that the windows node overlay for v2 would be pretty involved, and we have to copy that for every overlay that needs v2 (and remove it when we move the base manifest to use v2). If that's okay, I can go ahead with that

alexander-ding avatar Nov 18 '22 03:11 alexander-ding

thanks for the feedback @mattcary

@alexander-ding I think that the steps you wrote look good to me, I think that it's easy to add things because kustomize will just merge the manifests, it's a little bit difficult to delete things though (e.g. the sections in the Spec that mount and use the named pipes) but it's also doable.

mauriciopoppe avatar Nov 22 '22 23:11 mauriciopoppe

  1. Keep the base manifests as is
  2. Rename stable-master to stable-1.25
  3. Create a new stable-master that overlays v2 stuff on the base manifest
  4. Modify dev, noauth, and prow overlays to use v2 as well

Oops, sorry for the lag in reply.

Yes, those 4 steps are what I had in mind.

Is that right?

Note that the windows node overlay for v2 would be pretty involved, and we have to copy that for every overlay that needs v2 (and remove it when we move the base manifest to use v2). If that's okay, I can go ahead with that

Yeah, understood. I think it's important to keep the base simple while we're tweaking the v2 overlay ---we need to be able to easily make changes to the production overlays which mean 1.23-1.25 at this point.

When v2 development gets stable then we can switch over.

mattcary avatar Nov 23 '22 23:11 mattcary

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019

alexander-ding avatar Nov 28 '22 19:11 alexander-ding

@alexander-ding: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

  • /test pull-gcp-compute-persistent-disk-csi-driver-e2e
  • /test pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
  • /test pull-gcp-compute-persistent-disk-csi-driver-sanity
  • /test pull-gcp-compute-persistent-disk-csi-driver-unit
  • /test pull-gcp-compute-persistent-disk-csi-driver-verify

The following commands are available to trigger optional jobs:

  • /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Use /test all to run the following jobs that were automatically triggered:

  • pull-gcp-compute-persistent-disk-csi-driver-e2e
  • pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
  • pull-gcp-compute-persistent-disk-csi-driver-unit
  • pull-gcp-compute-persistent-disk-csi-driver-verify

In response to this:

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 28 '22 19:11 k8s-ci-robot

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

alexander-ding avatar Nov 28 '22 19:11 alexander-ding

@mattcary @mauriciopoppe I pushed some changes with the overlays. However, I noticed that the stable overlays all point to specific image tags, but we don't have a fixed version tag for the latest development image.

I did some more readings on the release process for this repository, namely the overlay strategies and go/pdcsi-oss-release-process (linked to by the repository README). I think we should take a step back and talk through the release pipeline.

Strictly speaking, this PR doesn't introduce new functionalities, so it can be considered a patch, not a new minor release. In this case, this PR should only update the development overlays (dev and noauth-debug) and nothing else. Then, we need to merge this PR to master and cherry pick the change into the release 1.8 branch, which is currently in prerelease, and release the updated driver as v1.8.0-rc2.

Once the new prerelease is generated, we can create a separate PR to update the prow-stable-sidecar-rc-master overlay to use the rc2 prerelease and ensure that tests are green.

When ready to release, we can update all the relevant production overlays (stable-1-23, stable-1-24, and stable-master) to be the same as the prow-stable-sidecar-rc-master staging overlay.

Releasing stable-1-25 should be a separate process.

If we decide that this is a big enough change to merit its own minor version, then this release needs to be v1.9.0. We would have to go through the entire pre-release/release process. We should probably wait until v1.8 is stable before merging to master.

Let me know what you think. I'll go ahead and revert to the overlay changes I outlined (only dev and noauth-debug).

alexander-ding avatar Nov 28 '22 21:11 alexander-ding