velero-plugin-for-microsoft-azure icon indicating copy to clipboard operation
velero-plugin-for-microsoft-azure copied to clipboard

Don't use storage account key for object storage if SP or MI is provided

Open yvespp opened this issue 3 years ago • 31 comments

Implements https://github.com/vmware-tanzu/velero/issues/4267

The plugin no longer uses the storage access key to access the blobs by default. The key is only used if it's configured via AZURE_STORAGE_ACCOUNT_ACCESS_KEY. Therefore "Allow storage account key access" can be disabled on the storage account if using MI or SP.

Changes:

  • Switched object_store to the newer azblobclient
  • When the storage account access key is not provided delegation SAS instead of service SAS is used.
    • delegation SAS isn't implemented in the azblob client yet so I implemented a simple version in delegationsas.go. This code can be delete once it's implemented in azblob
    • resourceGroupConfigKey and subscriptionIDConfigKey are no longer needed for the object store
  • Added an integration test for the object_storage
    • Only runs when the tag integration is specified e.g. go test ./... -tags=integration -v
    • The storage account for the test has to be created separatly.
  • Updated README
  • Updated deps
    • go 1.17
    • go dependencies
    • busybox image

yvespp avatar Nov 18 '21 12:11 yvespp

I downgraded it to 1.16 and also added a go setup step to pr.yaml, it seems it was using the preinstalled version before.

yvespp avatar Nov 19 '21 06:11 yvespp

Hi!

Has someone time to review this? BTW: is there a way to run integration tests against Azure in this repo?

Thanks, Yves

yvespp avatar Nov 26 '21 12:11 yvespp

Rebased to master. Any chance this can be merged?

yvespp avatar Feb 16 '22 12:02 yvespp

This is exactly what we need as well and makes a lot more sense than dealing with access keys when they aren't necessary.

nxf5025 avatar Feb 17 '22 22:02 nxf5025

Codecov Report

Merging #111 (e7efb9e) into main (73d1530) will increase coverage by 1.29%. The diff coverage is 7.00%.

@@            Coverage Diff             @@
##             main     #111      +/-   ##
==========================================
+ Coverage   12.77%   14.07%   +1.29%     
==========================================
  Files           4        4              
  Lines         626      668      +42     
==========================================
+ Hits           80       94      +14     
- Misses        542      571      +29     
+ Partials        4        3       -1     
Impacted Files Coverage Δ
velero-plugin-for-microsoft-azure/object_store.go 2.38% <0.54%> (-0.27%) :arrow_down:
...o-plugin-for-microsoft-azure/volume_snapshotter.go 23.68% <4.44%> (+0.28%) :arrow_up:
velero-plugin-for-microsoft-azure/common.go 24.59% <55.55%> (+24.59%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Jun 29 '22 11:06 codecov-commenter

Does the PR also addresses this issue? https://github.com/vmware-tanzu/velero/issues/4930

vzabawski avatar Jul 14 '22 11:07 vzabawski

Does the PR also addresses this issue? vmware-tanzu/velero#4930

Yes, ListKeys will be removed with this PR.

yvespp avatar Jul 14 '22 14:07 yvespp

Awesome! Thank you for your contribution.

vzabawski avatar Jul 15 '22 07:07 vzabawski

Reliance on Azure storage account shared keys is preventing my org from using Velero, would be great to see this merged :heart: @ywk253100 @eleanor-millman

chrisob avatar Jul 28 '22 12:07 chrisob

HI @yvespp I was trying to test out this PR 1 basic flow I was not able to validate - I created a BSL with incorrect params(SA didn't exist) But the BSL is showing as available. Possibly some underlying layers are not returning errors? The logs also did not help much debugging

When I created a backup, the following error came in the Backup Status. The BSL was still showing as available

│ completionTimestamp: "2022-09-08T05:12:03Z" ││ expiration: "2022-10-08T05:12:03Z" ││ failureReason: | ││ error checking if backup already exists in object storage: rpc error: code = Unknown desc = ===== RESPONSE ERROR (ErrorCode=AuthorizationPermissionMismatch) ===== ││ Description=, Details: (none) ││ formatVersion: 1.1.0 ││ phase: Failed ││ startTimestamp: "2022-09-08T05:12:03Z" ││ version: 1

anshulahuja98 avatar Sep 08 '22 05:09 anshulahuja98

@yvespp Sorry for the late review.

I have added some comments but haven't reviewed the delegationsas.go yet and I will continue to add comments.

ywk253100 avatar Sep 09 '22 09:09 ywk253100

Hi! Would be really great if this could be merged sometime soon :) @sseago @ywk253100 @dsu-igeek

mimmi-gjems avatar Sep 22 '22 09:09 mimmi-gjems

I second that. This PR will allow me to start using Velero.

vzabawski avatar Sep 22 '22 11:09 vzabawski

HI @yvespp I was trying to test out this PR 1 basic flow I was not able to validate - I created a BSL with incorrect params(SA didn't exist) But the BSL is showing as available. Possibly some underlying layers are not returning errors? The logs also did not help much debugging

When I created a backup, the following error came in the Backup Status. The BSL was still showing as available

│ completionTimestamp: "2022-09-08T05:12:03Z" ││ expiration: "2022-10-08T05:12:03Z" ││ failureReason: | ││ error checking if backup already exists in object storage: rpc error: code = Unknown desc = ===== RESPONSE ERROR (ErrorCode=AuthorizationPermissionMismatch) ===== ││ Description=, Details: (none) ││ formatVersion: 1.1.0 ││ phase: Failed ││ startTimestamp: "2022-09-08T05:12:03Z" ││ version: 1

Hi @anshulahuja98, do you know how the backup location is verified? I was looking through the code but couldn't find it.

yvespp avatar Sep 26 '22 07:09 yvespp

@anshulahuja98 the bsl verification should work correctly now. ListCommonPrefixes and ListObjects was returning the wrong error. However, the error is a bit confusing as a 401 (RESPONSE 401: 401 Server failed to authenticate the request) is returned instead of a 404.

yvespp avatar Sep 26 '22 09:09 yvespp

@anshulahuja98 the bsl verification should work correctly now. ListCommonPrefixes and ListObjects was returning the wrong error. However, the error is a bit confusing as a 401 (RESPONSE 401: 401 Server failed to authenticate the request) is returned instead of a 404.

Great that BSL validation would be fixed.

With regards to error - yes it has become much more confusing now. Earlier it was easier to decode information from the error message - for permissions the client ID and the target principal ID were explicitly mentioned which made permissions fixes straightforward. I am assuming that with this route - we might not be able to get verbose error messages

Would be great if we could still find some way to make it verbose.

anshulahuja98 avatar Sep 28 '22 04:09 anshulahuja98

@yvespp The object store uses the new version of Azure SDK to do the auth and operations in this PR, but the snapshotter still uses the old one. Is it possible to update the snapshotter to use the new SDK as well?

ywk253100 avatar Sep 30 '22 08:09 ywk253100

@yvespp The object store uses the new version of Azure SDK to do the auth and operations in this PR, but the snapshotter still uses the old one. Is it possible to update the snapshotter to use the new SDK as well?

I moved the snapshotter to the new sdk now and the go-autorest was completely remove. I couldn't test this and it would be good if some else could verify that it works.

yvespp avatar Oct 03 '22 16:10 yvespp

@yvespp I have finished all the reviewing and thanks very much for the contribution. I'll trigger end-to-end testing to test the new plugin and will let you know the result

ywk253100 avatar Oct 10 '22 09:10 ywk253100

The BSL validation still doesn't work with the error:

RESPONSE 403: 403 This request is not authorized to perform this operation using this permission.
ERROR CODE: AuthorizationPermissionMismatch

ywk253100 avatar Oct 11 '22 12:10 ywk253100

And I find another issue: as we change the needed permissions for the customized role, if users use the customized role, this introduces break change for them

ywk253100 avatar Oct 11 '22 12:10 ywk253100

And I find another issue: as we change the needed permissions for the customized role, if users use the customized role, this introduces break change for them

It will be a breaking change even if a custom role isn't used. To write to a blob storage using MI oder SP the role Storage Blob Data Contributor is needed, Contributor isn't enough. I added a changelog to document this. In the last commit I also updated the docs with the required roles.

yvespp avatar Oct 12 '22 11:10 yvespp

So this is the reason that I got the AuthorizationPermissionMismatch error as I still use the old credential with only Contributor role.

I'm afraid this breaking change isn't acceptable, this impacts all upgraded users.

We should re-consider the changes and as we're very close to the 1.10 FC date, this PR may be out of the scope of 1.10 if we have no better solution.

ywk253100 avatar Oct 13 '22 01:10 ywk253100

So this is the reason that I got the AuthorizationPermissionMismatch error as I still use the old credential with only Contributor role.

I'm afraid this breaking change isn't acceptable, this impacts all upgraded users.

We should re-consider the changes and as we're very close to the 1.10 FC date, this PR may be out of the scope of 1.10 if we have no better solution.

One option would be to fall back to fetching the shared key if accessing the blobs using the default credential fails. But because the plugin doesn't know the container at init time to test access, the client creation has to be delayed until the object store is used. I'll see if I can implement that without making the code too complicated.

I think it would still make sense to clean up the permissions at some point and remove this compatibility code. My question is when it would be allowed to have that breaking change?

yvespp avatar Oct 13 '22 06:10 yvespp

So this is the reason that I got the AuthorizationPermissionMismatch error as I still use the old credential with only Contributor role. I'm afraid this breaking change isn't acceptable, this impacts all upgraded users. We should re-consider the changes and as we're very close to the 1.10 FC date, this PR may be out of the scope of 1.10 if we have no better solution.

One option would be to fall back to fetching the shared key if accessing the blobs using the default credential fails. But because the plugin doesn't know the container at init time to test access, the client creation has to be delayed until the object store is used. I'll see if I can implement that without making the code too complicated.

I think it would still make sense to clean up the permissions at some point and remove this compatibility code. My question is when it would be allowed to have that breaking change?

I went a different route. I added an optional config parameter disableStorageAccountKeyFetch wich is set to "false" by default. If it is "false" the object store will work as before: using the SP or MI only to fetch the storage account key and then using that to access the storage account. This requires no additional permissions and won't break users. If it is set to "true" the storage account key won't be used but new permissions are required.

I documented the change with the recommendation to set disableStorageAccountKeyFetch to "true", so new users will not use the old behaviour. In a future release the old behaviour and the option removed.

yvespp avatar Oct 13 '22 18:10 yvespp

@yvespp Thanks for the quick update. I think we need a new major version(v2.x) to introduce such kinds of breaking changes, so it may be a long time before we remove the backward compatibility codes.

I'll take a look at the new changes and test them in my local env and run the E2E test, will let you know the result.

ywk253100 avatar Oct 14 '22 01:10 ywk253100

@yvespp I'm a little confused about the Azure roles. The Contributor role almost has all the permissions why do we still need the Storage Blob Data Contributor role, is there any Azure doc for this part?

ywk253100 avatar Oct 14 '22 01:10 ywk253100

@yvespp I'm a little confused about the Azure roles. The Contributor role almost has all the permissions why do we still need the Storage Blob Data Contributor role, is there any Azure doc for this part?

The clearest doc about this I found is here: Authorize access to blobs using Azure Active Directory

Only roles explicitly defined for data access permit a security principal to access blob data. Built-in roles such as Owner, Contributor, and Storage Account Contributor permit a security principal to manage a storage account, but do not provide access to the blob data within that account via Azure AD. However, if a role includes Microsoft.Storage/storageAccounts/listKeys/action, then a user to whom that role is assigned can access data in the storage account via Shared Key authorization with the account access keys. For more information, see Choose how to authorize access to blob data in the Azure portal.

I think about it this way: Contributor is for the control plane and Storage Blob Data Contributor for the data plane.

yvespp avatar Oct 14 '22 06:10 yvespp

I removed disableStorageAccountKeyFetch again and the storage access key is used only used if subscriptionId and resourceGroup is set. I think this is easier to configure as users don't need to add and then in a future release remove the disableStorageAccountKeyFetch parameter again.

yvespp avatar Oct 15 '22 14:10 yvespp

@yvespp Thanks for the quick update.

I'm working on some other priorities and have no bandwidth to review the new changes in the next few days. So it cannot meet the FC date(early next week), I'll come back to review it again before RC, but as the changes introduced by this PR are very fundamental it's better to run enough tests for it, so if we cannot do enough review and test before the RC, I'd like to move it to the 1.11 release and make it merged in the early stage of 1.11.

ywk253100 avatar Oct 19 '22 08:10 ywk253100