aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

efs-plugin crash loops when a storage class is created with a fixed uid and gid, and access point creation fails

Open kyanar opened this issue 2 years ago • 2 comments

/kind bug

What happened? I created a storage class attached to an EFS file system. As soon as the EFS CSI driver became aware of it, the efs-plugin container crashed and began crash looping with the following exception:

github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver.(*IntHeap).Push(...)
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver/gid_allocator.go:25
github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver.(*GidAllocator).releaseGid(0xc000075f80, {0xc0000b01e0, 0xc0004184e0}, 0x21)
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver/gid_allocator.go:74 +0xcf
github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver.(*Driver).CreateVolume(0xc00013c180, {0xd6fea0, 0xc00009e960}, 0xc000504000)  
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver/controller.go:244 +0xedb  
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0xd6fea0, 0xc00009e960}, {0xba6320, 0xc000504000})  
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5676 +0x78  
github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver.(*Driver).Run.func1({0xd6fea0, 0xc00009e960}, {0xba6320, 0xc000504000}, 0xc000103bb8, 0xb0cf20)  
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/pkg/driver/driver.go:101 +0x3d  
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0xbb62a0, 0xc00013c180}, {0xd6fea0, 0xc00009e960}, 0xc000094d20, 0xc0d6d8)  
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5678 +0x138  
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00014dc00, {0xd7b060, 0xc0000fe000}, 0xc0000c0240, 0xc0003efb30, 0x122d1a0, 0x0)  
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/vendor/google.golang.org/grpc/server.go:1286 +0xc8f  
google.golang.org/grpc.(*Server).handleStream(0xc00014dc00, {0xd7b060, 0xc0000fe000}, 0xc0000c0240, 0x0)
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/vendor/google.golang.org/grpc/server.go:1609 +0xa2a  
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/vendor/google.golang.org/grpc/server.go:934 +0x98  
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/kubernetes-sigs/aws-efs-csi-driver/vendor/google.golang.org/grpc/server.go:932 +0x294

CloudTrail indicated that the CreateAccessPoint call failed due to the identity-based policy not granting permission to perform the action on the specified file system due to resource tags not matching.

What you expected to happen? The storage class should work, the persistent volumes should become available, and pods should claim them. If the access point creation fails an appropriate error message should be reported and the driver should not crash.

How to reproduce it (as minimally and precisely as possible)? The following manifest created the storage class that caused the crash:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: redacted-k8s-storage
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-045a5356b0acxxxxx
  basePath: "/kubernetes_dynamic_pvc"
  directoryPerms: "700"
  uid: "33"
  gid: "33"

The matching file system was missing the "aws:ResourceTag/efs.csi.aws.com/cluster" that was a conditional in the IAM policy.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.6-eks-7d68063", GitCommit:"f24e667e49fb137336f7b064dba897beed639bad", GitTreeState:"clean", BuildDate:"2022-02-23T19:29:12Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}

  • Driver version: 1.3.8

  • Chart version: 2.2.6

kyanar avatar May 11 '22 05:05 kyanar

Upon review I discovered that the issue was that the Access Point creation failed with an Access Denied error. This should result in an appropriate message being output into the efs-plugin container's standard output, but before the message being output is a call to d.gidAllocator.releaseGid which segfaults since if a fixed uid and gid is provided, GidAllocator.initFsId was never called and as such GidAllocator.fsIdGidMap is null and releaseGid attempts to blindly call Push on a nil object. Will submit PR to fix this.

kyanar avatar May 12 '22 00:05 kyanar

PR #733 will solve this if/when it gets merged or superceded.

kyanar avatar Jul 24 '22 06:07 kyanar

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 22 '22 07:10 k8s-triage-robot

/remove-lifecycle stale

kyanar avatar Oct 22 '22 07:10 kyanar

Upon review I discovered that the issue was that the Access Point creation failed with an Access Denied error. This should result in an appropriate message being output into the efs-plugin container's standard output, but before the message being output is a call to d.gidAllocator.releaseGid which segfaults since if a fixed uid and gid is provided, GidAllocator.initFsId was never called and as such GidAllocator.fsIdGidMap is null and releaseGid attempts to blindly call Push on a nil object. Will submit PR to fix this.

The same issue can be found if the Access Point creation fails because the AP limit has been hit. If a static GID/UID is used, the nil pointere dereference happens. If a dynamic GID/UID is used, a proper error message is shown.

steromano87 avatar Nov 17 '22 22:11 steromano87

It appears that pull #850 fixes this. (AWS Support also tell me that the 1.47 release scheduled for 16 December includes this pull)

kyanar avatar Dec 14 '22 23:12 kyanar

@kyanar Hi, your analysis is indeed correct. I was not aware of your issue and was fixing another bug with the GID allocator, but yes #850 should fix this issue as well.

If Amazon ever increases the AP limit the driver could run into performance issues since my patch relies on having max 120 APs per EFS filesystem and safely ignores any GID range above that limit, for reasons unknown to me user can use SC parameters to set the range to quite huge numbers, but anything above this internal limit can't be used with the old code anyway. With the patch it would check all possible GIDs (limited by 120) from given range each time a volume is created which has $O{(n)}$ time complexity and would not scale well if the limit ever increases significantly.

RomanBednar avatar Dec 19 '22 08:12 RomanBednar

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 19 '23 09:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 18 '23 09:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar May 18 '23 09:05 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 18 '23 09:05 k8s-ci-robot