azurefile-csi-driver icon indicating copy to clipboard operation
azurefile-csi-driver copied to clipboard

driver matches private storage accounts without networkEndpointType set

Open RomanBednar opened this issue 1 year ago • 6 comments

While there is an existing storage account in Azure which is configured to use private endpoints and disallow public network access the driver can match this storage account even with networkEndpointType SC parameter unset. However it will not configure a private endpoint for worker subnet because networkEndpointType is not set to privateendpoint - this will cause volume mounts to fail on worker nodes.

This seems like a bug and the driver probably should not match "private" storage accounts if networkEndpointType parameter is omitted or set to "". Maybe there is some other parameter(s) for this scenario that we missed?

Driver version: v1.30.3

Reproducer:

  • a prerequisite is to have an existing storage account with private endpoints and disallowed public network access (in the example below it's the imageregistryrbednah6mbs)

Storage Class has networkEndpointType: "" (or omitted):

$ oc get sc/azurefile-csi-pub -o yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2024-08-30T11:35:13Z"
  name: azurefile-csi-pub
  resourceVersion: "55979"
  uid: 0e152336-614b-41fc-aafc-3c0ffefd787a
mountOptions:
- mfsymlinks
- cache=strict
- nosharesock
- actimeo=30
parameters:
  networkEndpointType: ""
  skuName: Standard_LRS
provisioner: file.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Driver matches an existing storage account:

I0830 11:43:02.734943       1 utils.go:101] GRPC call: /csi.v1.Controller/CreateVolume
I0830 11:43:02.734967       1 utils.go:102] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2","parameters":{"csi.storage.k8s.io/pv/name":"pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2","csi.storage.k8s.io/pvc/name":"pvc-1","csi.storage.k8s.io/pvc/namespace":"default","networkEndpointType":"","skuName":"Standard_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["mfsymlinks","cache=strict","nosharesock","actimeo=30"]}},"access_mode":{"mode":7}}]}
I0830 11:43:03.097163       1 controllerserver.go:561] begin to create file share(pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2) on account(imageregistryrbednah6mbs) type(Standard_LRS) subID() rg(rbednar-az-01-8z785-rg) location() size(1) protocol(SMB)
I0830 11:43:03.383519       1 controllerserver.go:593] create file share pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2 on storage account imageregistryrbednah6mbs successfully
I0830 11:43:03.399781       1 controllerserver.go:639] store account key to k8s secret(azure-storage-account-imageregistryrbednah6mbs-secret) in default namespace
I0830 11:43:03.399821       1 utils.go:108] GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2","csi.storage.k8s.io/pvc/name":"pvc-1","csi.storage.k8s.io/pvc/namespace":"default","networkEndpointType":"","secretnamespace":"default","skuName":"Standard_LRS"},"volume_id":"rbednar-az-01-8z785-rg#imageregistryrbednah6mbs#pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2###default"}}

The storage account matched is private:

$ az storage account show --name imageregistryrbednah6mbs --resource-group rbednar-az-01-8z785-rg --query "publicNetworkAccess" --output json
"Disabled"

With this configuration worker nodes can not mount any newly provisioned volumes as there is no private endpoint configured for the worker subnet:

I0830 11:51:25.588693       1 nodeserver.go:326] cifsMountPath(/var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/ff63a5ad356eed4e6c55d99793032daf932997f18546669ac219a9df4571a26f/globalmount) fstype() volumeID(rbednar-az-01-8z785-rg#imageregistryrbednah6mbs#pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2###default) context(map[csi.storage.k8s.io/pv/name:pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2 csi.storage.k8s.io/pvc/name:pvc-1 csi.storage.k8s.io/pvc/namespace:default networkEndpointType: secretnamespace:default skuName:Standard_LRS storage.kubernetes.io/csiProvisionerIdentity:1725012839930-7153-file.csi.azure.com]) mountflags([mfsymlinks cache=strict nosharesock actimeo=30]) mountOptions([mfsymlinks cache=strict nosharesock actimeo=30 file_mode=0777 dir_mode=0777]) volumeMountGroup()
E0830 11:51:25.606841       1 mount_linux.go:230] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,file_mode=0777,dir_mode=0777,<masked> //imageregistryrbednah6mbs.file.core.windows.net/pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/ff63a5ad356eed4e6c55d99793032daf932997f18546669ac219a9df4571a26f/globalmount
Output: mount error(13): Permission denied
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs) and kernel log messages (dmesg)

E0830 11:51:25.606924       1 utils.go:106] GRPC error: rpc error: code = Internal desc = volume(rbednar-az-01-8z785-rg#imageregistryrbednah6mbs#pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2###default) mount //imageregistryrbednah6mbs.file.core.windows.net/pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2 on /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/ff63a5ad356eed4e6c55d99793032daf932997f18546669ac219a9df4571a26f/globalmount failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,file_mode=0777,dir_mode=0777,<masked> //imageregistryrbednah6mbs.file.core.windows.net/pvc-f3da48fa-7952-4918-aedd-67c1c814b6e2 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/ff63a5ad356eed4e6c55d99793032daf932997f18546669ac219a9df4571a26f/globalmount
Output: mount error(13): Permission denied
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs) and kernel log messages (dmesg)

Please refer to http://aka.ms/filemounterror for possible causes and solutions for mount errors.

NOTE: If I create SC with networkEndpointType=privateendpoint and provision a volume with it, the driver will set the private endpoint correctly on the existing storage account which fixes the mount issue for any other consecutive mounts that would reuse this storage account. This can be a temporary workaround.

Followup question: We have the above mentioned configuration in OpenShift private clusters where registry operator creates it's own storage resources including a (private) storage account and CSI provisioning happens after that. Even if this issue is resolved still the driver can in similar scenarios reuse "foreign" storage accounts and we're not sure if that's a good practice. Later, we would like to introduce a fix for this as well. Is there a recommended way to ensure the driver will never reuse a storage account not created by it? Could storageaccount or createaccount parameter be sufficient and safe solution?

RomanBednar avatar Aug 30 '24 12:08 RomanBednar

if networkEndpointType is unset, that means the csi driver would matching any account with or without networkEndpointType; if you only want to match networkEndpointType: privateEndpoint accounts, then you need to specify this parameter in storage class. @RomanBednar

andyzhangx avatar Sep 01 '24 09:09 andyzhangx

if networkEndpointType is unset, that means the csi driver would matching any account with or without networkEndpointType

@andyzhangx This is the actual issue, simplified. With this behavior we can not tell the driver to "avoid" private endpoint accounts. Either we request a private one specifically (predictable) or leave unset which means driver can match both private or public (unpredictable). And what we need is a way to reliably not match private endpoint accounts making it predictable.

RomanBednar avatar Sep 02 '24 09:09 RomanBednar

@andyzhangx Do you think this could be considered a bug? We're looking for a fix that we could ship soon and it depends on the decision made for this issue.

I've found a fix that might work for now - if we set parameters to have a tag and require matching the tag, the driver will always create it's own storage account with first volume provisioned and then reuse it (unless there's one with a matching tag, which is unlikely yet still possible). With this we should be almost certain we don't match "foreign" storage accounts right? Or is it a bad idea?

RomanBednar avatar Sep 09 '24 11:09 RomanBednar

if networkEndpointType is unset, that means the csi driver would matching any account with or without networkEndpointType

I think that's a bug. It should match an usable account, not just any account. A private account + StorageClass with networkEndpointType: "" will lead to a PV that cannot be mounted today. Maybe there is another storage account that does not require private endpoints and would fit that CreateVolume call better.

jsafrane avatar Sep 09 '24 15:09 jsafrane

if networkEndpointType is unset, that means the csi driver would matching any account with or without networkEndpointType

I think that's a bug. It should match an usable account, not just any account. A private account + StorageClass with networkEndpointType: "" will lead to a PV that cannot be mounted today. Maybe there is another storage account that does not require private endpoints and would fit that CreateVolume call better.

@jsafrane CSI driver does not know whether the account is usable or not, it depends on the application pods, whether it's running on allowed vnet node. there is a parameter mapping rule: in general, if the value is not specified, the driver would just check the corresponding properties.

@RomanBednar I think best way to solve this issue is specifying networkEndpointType: privateEndpoint in sc though tag also works.

andyzhangx avatar Sep 10 '24 08:09 andyzhangx

@andyzhangx I've tested that and it rather masks the problem and poses a few others: if we set networkEndpointType: privateEndpoint the driver will always configure a private endpoint on new storage account. And if there already is one that matched it will update it with private endpoint - but we don't know which account gets updated.

Then we should still allow "public" storage accounts for CSI driver to exist alongside arbitrary "private" ones (supported) and with this solution it would not be possible.

RomanBednar avatar Sep 11 '24 09:09 RomanBednar

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 10 '24 10:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 09 '25 10:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Feb 08 '25 10:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Feb 08 '25 10:02 k8s-ci-robot