aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Skip deallocating Gid when static Gid set

Open kyanar opened this issue 2 years ago • 10 comments

Is this a bug fix or adding new feature? Bug fix

What is this PR about? / Why do we need it? The efs-plugin crashes with a segmentation fault if there was an error creating an access point if the storage class is defined with a fixed uid and gid because it attempts to deallocate the gid using an uninitialised gidAllocator, and does not log the error for the cluster admin to resolve, as the code to log the error occurs after the segfault.

This PR adds a check to see if the "allocated gid" is the default Go int value, and skips attempting to deallocate it if is so.

What testing is done? Is captured by existing test case for fixed uid/gid allocation

kyanar avatar Jul 07 '22 05:07 kyanar

Hi @kyanar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jul 07 '22 05:07 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kyanar To complete the pull request process, please assign ashley-wenyizha after the PR has been reviewed. You can assign the PR to them by writing /assign @ashley-wenyizha in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 07 '22 05:07 k8s-ci-robot

/cc @ashley-wenyizha

kyanar avatar Aug 22 '22 00:08 kyanar

This issue is also currently preventing us from being able to leverage the newer versions of the efs-csi-driver for our production environments.

daghaian avatar Aug 30 '22 22:08 daghaian

Having this same issue as well in our environment since we are setting GIDs. Would appreciate it if this could be reviewed and merged please!

MarkSpencerTan avatar Aug 30 '22 22:08 MarkSpencerTan

@MarkSpencerTan @daghaian just so you're aware, if you're running into this issue it's because the EFS CSI driver has already failed to mount your EFS, because the bug is unfortunately in the exception handler, meaning it doesn't get a chance to output a useable error message.

I recommend checking CloudTrail for any AccessDenied errors - in my case I was using a version of the IAM policy that did not confer efs:CreateAccessPoint on the CSI controller service account's IAM role.

kyanar avatar Aug 30 '22 23:08 kyanar

@kyanar Thanks for the suggestion. We did see AccessPointAlreadyExists errors being thrown inside CloudTrail so its plausible something is going on there.

daghaian avatar Aug 31 '22 16:08 daghaian

@kyanar thanks for the suggestions! We tried to figure out what was causing the AccessPointAlreadyExists errors that we were getting, however, it seems to be happening at random times or some sort of race condition and seems like the EFS CSI Driver handles this by just doing a retry and it works again. This is why we've never really seen any issues until we have started setting the Gid...

When the Gid is set and this problem arises, the EFS CSI Driver pods errors out continuously, preventing any further PVCs from being created until the problematic PVCs are cleared out.

I've captured the logs of the efs csi driver pod when this error happens without the Gid being set so you can see what it does normally:

https://gist.github.com/MarkSpencerTan/96775d9b2b3043ce7647693ecce309be#file-gistfile1-txt-L163

Would appreciate it if this fix becomes available since this is causing the efs-csi-driver to be very unstable in cases where the Gid is being set.

MarkSpencerTan avatar Sep 03 '22 01:09 MarkSpencerTan

Unfortunately none of the approvers for this project appear to be active on GitHub - the most recent was 18th August, so I can't find anyone to review it.

kyanar avatar Sep 05 '22 01:09 kyanar

/cc @Ashley-wenyizha

kyanar avatar Sep 26 '22 06:09 kyanar

@Ashley-wenyizha this is a one line fix for an issue affecting quite a few users, can this be looked at please?

kyanar avatar Oct 22 '22 07:10 kyanar

Pull #850 corrects this behaviour.

kyanar avatar Dec 14 '22 23:12 kyanar