aws-efs-csi-driver icon indicating copy to clipboard operation
aws-efs-csi-driver copied to clipboard

Adds capability to provision directories on the EFS dynamically

Open jonathanrainer opened this issue 3 years ago • 54 comments

This PR was the result of a hackathon at my current company as well as some work outside to sure up the unit tests and E2E test the final product.

Is this a bug fix or adding new feature? New feature, requested in #538 and #517 and possibly other issues, this comes up a lot as a feature people would like.

What is this PR about? / Why do we need it? Since its creation the driver has supported Access Point Provisioning as its main means of dynamic provisioning. However this is problematic for a few reasons, it can causes issues with user permissions around deleting provisioned directories and there is a hard limit of 120 AccessPoints per EFS which for some use cases is very quickly depleted. Further it requires various AWS IAM Permissions that can be complicated to sort out and manage.

This PR allows a new provisioning mode called efs-dir so that instead of creating EFS access points, directories are created instead. This new provisioning mode supports all of the previous parameters (UID, GID, GID Ranges etc.) and so is very easy to transition to, or run alongside the access point provisioning style if preferred. At present the directories are named after the PersistentVolume that is created to ensure a unique directory name and not run into security problems but once #640 merges there's no reason why the same method wouldn't apply to this provisioning method as well with a little bit of extra work. Either this PR can be updated or #640 can be extended if this merges first.

This is achieved by creating a new Interface called Provisioner that is implemented by an AccessPointProvisioner (the original method) and a DirectoryProvisioner (the new method) this also allows in future for different kinds of provisioning to occur, maybe FileSystemProvisioning for example.

What testing is done? I've moved around some of the Unit Tests that used to relate to controller.go into new files that reference each of the provisioners. These all run 🟢 and I'm fairly happy I've covered most of the important cases. Overall I'm not 100% sure about the placement of the tests and the coverage but it definitely covers all the new code and the code that existed and replicates existing behaviour.

I've also run this through on my own EKS cluster with the deleteProvisionedDir flag on and off and it performs exactly as I anticipated it would. Directories get provisioned, and then when the PVC gets deleted they either are deleted or remain. I've also written E2E tests to run in CI for creation and deletion but will need the PR to be approved before I can road test those.

fixes #538

jonathanrainer avatar Jul 01 '22 09:07 jonathanrainer

Hi @jonathanrainer. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jul 01 '22 09:07 k8s-ci-robot

@Ashley-wenyizha Hi, could I get an approved to test on here so I can start sorting out the E2E tests?

jonathanrainer avatar Jul 01 '22 10:07 jonathanrainer

@Ashley-wenyizha @wongma7 @jsafrane I truly believe a feature like this PR is critical for the EFS CSI driver to remain useful, and I would greatly appreciate your thoughts on this PR.

With the current approach, people will often get themselves into a situation where they are suddenly unable to provision new PVCs (once they hit 120 EFS "Access Points" in an AWS Account) and then have no idea what is wrong, let alone be able to fix it without moving to another CSI driver.

The kubernetes-csi/csi-driver-nfs driver already uses a "sub-folder" approach that works on arbitrary NFS servers (including EFS), so if the EFS driver is unwilling to implement this feature, we should consider deprecating this EFS driver and point people towards the official NFS one, in the interest of not leaving users with a ticking timebomb that breaks their cluster once 120 PVCs exist.

thesuperzapper avatar Jul 04 '22 01:07 thesuperzapper

/ok-to-test

jsafrane avatar Jul 14 '22 12:07 jsafrane

HI @jonathanrainer

Great work.

We are using the #640 for a while, but the 120 access points are a real blocker for us, so this looks like a better solution for our problem in the near future.

We only have two problems, and one of them, as you said in the description are solved by the #640 and that is the subPathPattern and ensureUniqueDirectory , since we want to write to the root folder (and folders paths inside), and we not want the creation of one folder per PVC. As an example:

  • we have controllers writing to the folder A using the subPath on the deployment
  • and we have some pods, writing to the folder A/xxx-xxx also using the subPath on the deployment

Using the #640 we are able to do this.

The other problem we see, is that, since you are not using access points, all the files in the EFS are written as root user.

Why not use a single access point per storage class to do all the work?

Thanks for the hard work and hope this two PR's are merged soon.

andre-lx avatar Aug 05 '22 15:08 andre-lx

@jonathanrainer Thanks for this PR. It solves a big problem for us where we can't use APs because we don't use Amazon DNS. I took this and merged with https://github.com/kubernetes-sigs/aws-efs-csi-driver/pull/687 (fixing some merge conflicts along the way) and they both work great together. One thing, which may or may not be related to the combined PR on my end, but for directory provisioning, when the SC reclaimPolicy: delete the created directory does not get deleted. If I set basePath: /foo/bar when the PVC and PV are deleted, /foo/bar/pvc-unique-id directory still remains.

oe-hbk avatar Aug 23 '22 20:08 oe-hbk

Thanks for this @jonathanrainer! I've done a build of this and deployed to our cluster where I've force-updated the EFS storage class to use directory provisioning. Existing PVs all still mount correctly via access points and new PVs are being provisioned correctly using directories.

FleetAdmiralButter avatar Aug 31 '22 02:08 FleetAdmiralButter

Hello, any news on this? I think it's critical to merge it for this driver's future.

ChamMach avatar Sep 24 '22 07:09 ChamMach

@jonathanrainer can you please rebase the PR? @wongma7 @torredil can you please review it? IMO it's quite solid.

jsafrane avatar Sep 26 '22 13:09 jsafrane

@jsafrane Sorry I've kind of let this one get away from me a bit, I've just changed job but should be able to dedicate some time to this later in the week. Think it needs a rebase and the E2Es need looking at but it's nearly there.

jonathanrainer avatar Sep 26 '22 14:09 jonathanrainer

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jonathanrainer Once this PR has been reviewed and has the lgtm label, please assign wongma7 for approval by writing /assign @wongma7 in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 30 '22 21:09 k8s-ci-robot

Right @jsafrane this is now fixed, realised that I'd made a mistake in the implementation of DeleteVolume so the node was trying to delete the folder while it was still mounted and thus got stuck, which caused a knock on effect to other tests in the suite. Now that's fixed and we've got 🟢 E2Es, this is ready to be reviewed properly.

@wongma7 and @torredil if you could have a look I'd be really appreciative so we can push this towards getting merged whenever is convenient.

jonathanrainer avatar Oct 02 '22 11:10 jonathanrainer

Hello @jonathanrainer any news on this? 🤔

ChamMach avatar Nov 03 '22 06:11 ChamMach

This seems SO close be being done, and is SO desperately needed! 😄 Think this might make it over the finish line in the near term? 🙏

metasim avatar Nov 30 '22 21:11 metasim

@jsafrane Hi, sorry it's taken so long but I've addressed your comments and rebased. Sadly I can't seem to work out a solution to the end-to-end tests, as every time I try and create the new folder the permissions set get ignored, hence the failure. Any ideas? I've been hitting my head against this for hours and can't find a solution :(

jonathanrainer avatar Feb 04 '23 15:02 jonathanrainer

@jonathanrainer I tried to run the tests, it seems to me that the error is caused by umask in the driver container. cat /proc/1/status shows Umask: 0022 there and MkDirAll says:

The permission bits perm (before umask) are used for all directories that MkdirAll creates.

So w bits are cleared from the group + other permissions. IMO it's not a bad idea to add os.Chmod after os.MkDirAll.

jsafrane avatar Feb 08 '23 13:02 jsafrane

@jsafrane Thanks so much, that got it sorted

@Ashley-wenyizha Realise the process of this merging might be a bit weird, so how can we progress this forward?

jonathanrainer avatar Feb 08 '23 16:02 jonathanrainer

I poked AWS people more than enough times. /lgtm /approve

@jonathanrainer thanks a lot for the PR and your patience!

jsafrane avatar Apr 26 '23 14:04 jsafrane

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathanrainer, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Apr 26 '23 14:04 k8s-ci-robot

Hi @jonathanrainer

Sorry for the delayed response, we will review this PR and see if we can release the feature in the coming releases.

Note : there are issues with our test cluster and we are doing a hot fix so pull-aws-efs-csi-driver-external-test-eks might fail on the PR.

Meanwhile, I see we have 19 commits in the PR, could we rebase and merge these commits to few key commits?

Thanks for your contribution and sorry about the delayed response

Ashley-wenyizha avatar Apr 26 '23 14:04 Ashley-wenyizha

/retest

jsafrane avatar Apr 26 '23 15:04 jsafrane

New changes are detected. LGTM label has been removed.

k8s-ci-robot avatar Apr 26 '23 17:04 k8s-ci-robot

Hi @jonathanrainer

Sorry for the delayed response, we will review this PR and see if we can release the feature in the coming releases.

Note : there are issues with our test cluster and we are doing a hot fix so pull-aws-efs-csi-driver-external-test-eks might fail on the PR.

Meanwhile, I see we have 19 commits in the PR, could we rebase and merge these commits to few key commits?

Thanks for your contribution and sorry about the delayed response

Hi, no worries just glad we can get this merged eventually :)

I've rebased and pulled it into 7 commits that cover the development process properly. Let me know if you want it more squashed or anything like that.

jonathanrainer avatar Apr 26 '23 17:04 jonathanrainer

/retest

jonathanrainer avatar Apr 27 '23 07:04 jonathanrainer

@Ashley-wenyizha so, are you going to review this? I am asking for review from AWS side for almost a year and always got a reply that "someone will look at it".

jsafrane avatar May 04 '23 16:05 jsafrane

@jsafrane Yes my team will review this, sorry about the delay, there has been internal team changes upon efs-csi-driver this year so we can allocate better resources working on our open source projects and support our customers.

On top of the code review itself, AWS require security team's review on new features as well so that will also cause some delay.

Sorry again about the delay, we now have attention upon this PR and https://github.com/kubernetes-sigs/aws-efs-csi-driver/pull/640

I will give update on this PR and PR 640 by the end of the week (5/12/2023) to see if we will need security review and timeline of this feature's release

These are very useful features and thanks for the effort and support @jonathanrainer!

Ashley-wenyizha avatar May 05 '23 18:05 Ashley-wenyizha

@Ashley-wenyizha is there any update on this in terms of timelines etc? Very happy to assist in this however I can

jonathanrainer avatar May 15 '23 06:05 jonathanrainer

hi @jonathanrainer

We are waiting for security team's reply, meanwhile we will start the code review and better understand of the new feature as we will need to write up the security survey as well for the new feature (and thanks for putting in the information in the feature overview as well, that is very helpful).

Thanks for your patience and follow up!

Ashley-wenyizha avatar May 15 '23 13:05 Ashley-wenyizha

@thesuperzapper, @andre-lx

Apologize for the super delayed response. We have started to support since Jan 17 2023

https://aws.amazon.com/about-aws/whats-new/2023/01/amazon-efs-increases-access-points-file-system/

Please feel free to ignore if you already aware of this feature enhancement.

Ashley-wenyizha avatar May 19 '23 18:05 Ashley-wenyizha

Hi @Ashley-wenyizha again, has there been any further progress or would you like any assistance in documenting the new feature? More than happy to help in any way I can :)

jonathanrainer avatar May 29 '23 06:05 jonathanrainer

@Ashley-wenyizha Hi, it's been over a month, do we have a timeline on a release plan for this and #640 ? Again, happy to help in any way I can :)

jonathanrainer avatar Jul 07 '23 15:07 jonathanrainer