azure-cli icon indicating copy to clipboard operation
azure-cli copied to clipboard

CONNECTIVITY_REFRESH_TOKEN_ERROR when using `az acr` with AzureCli@2

Open nosalan opened this issue 2 years ago • 11 comments

Related command

az acr repository update --name $TargetAcr `
    --image "$ArtifactToPromote" `
    --write-enabled false `
    --delete-enabled false

Describe the bug

We have a script that contains a bunch of az acr import and az acr repository update commands. It is a powershell core script, which itself is running in the Azure DevOps pipeline, inside AzureCli@2 task. The script is basically simple script to import images to ACR and lock them. It works fine until following error appears sometimes(!) near the end of the script:

WARNING: Unable to get AAD authorization tokens with message: 2022-12-12 16:05:11.564324 An error occurred: CONNECTIVITY_REFRESH_TOKEN_ERROR
Access to registry '<redacted>.azurecr.io' was denied. Response code: 429. Please try running 'az login' again to refresh permissions.

After this error, basically nothing can be done but running the build again. Retry does not help because az has lost the service principal context, which was given by AzureCli@2 task.

To Reproduce Execute a script that contains a bunch (about 20) of az acr import and az acr repository update commands ran sequentially.

Expected behavior Assuming that this error comes from ACR throttling (deducted from 429 status code) a retry of az acr repository update within the script should work but it doesn't. The 'az cli' demands to run az login again, but that is not possible within AzureCli@2 task. Running of az login is not an option because the script is invoked inside of AzureCli@2 task which has a service principal context assigned through a service connection.

Environment summary Latest Azure Devops ubuntu agent AzureCli@2

Additional context

nosalan avatar Dec 12 '22 16:12 nosalan

route to CXP team

yonzhan avatar Dec 12 '22 23:12 yonzhan

Assuming that this error comes from ACR throttling (deducted from 429 status code) a retry of az acr repository update within the script should work but it doesn't.

Hi @nosalan, the 429 error indicates that the registry is being throttled. az login is not needed.

Throttling could occur temporarily when you generate a burst of image pull or push operations in a very short period, even when the average rate of read and write operations is within registry limits. Instead of having the script to retry immediately, you may need to implement retry logic with some backoff in your script or reduce the maximum rate of requests to the registry.

The limits for different registry tiers can be found at Service tier features and limits.

Wwwsylvia avatar Dec 16 '22 07:12 Wwwsylvia

Thank you @Wwwsylvia but the error indicate something with authentication/permissions. Seems like receiving 429 from ACR triggers some authentication/permissions issues in az cli. Full error

Access to registry '<redacted>.azurecr.io' was denied. Response code: 429. Please try running 'az login' again to refresh permissions.
WARNING: Unable to get admin user credentials with message: Admin user is disabled.
ERROR: Unable to authenticate using AAD or admin login credentials. Please specify both username and password in non-interactive mode.

nosalan avatar Dec 21 '22 13:12 nosalan

@nosalan Apologies for the late reply. Thanks for reaching out to us and reporting this issue. I agree with the @Wwwsylvia comments. Adding a few more pointers here if that helps:

If there was an issue with access token or permissions the response code would have been 403. However, in your case it was 429 error which is for the Throttling. In such cases immediate retries will not help. Doing aggressive retries usually make the problem worse. For this reason, you should use an exponential back off. For example, your application may retry after 2 seconds, then 4 seconds, then 10 seconds, then 30 seconds. This behavior results in your application significantly reducing its load on the Azure service rather than exacerbating any problems.

Troubleshooting the CONNECTIVITY_REFRESH_TOKEN_ERROR issue is explained here. Troubleshooting the HTTP 429 error is explained here.

navba-MSFT avatar Dec 23 '22 03:12 navba-MSFT

@nosalan I wanted to do quick follow-up to check if you had a chance to look at my above comment. Please let us know if you had any updates on this. Awaiting your reply.

navba-MSFT avatar Dec 27 '22 04:12 navba-MSFT

@navba-MSFT Yes and thank you. I'm ok with adding the back-off based retries. The status code 429 explains the problem. I would only suggest to update the error message to something like Too many requests, please retry after XX seconds, instead of this one Please try running 'az login' again to refresh permissions. WARNING: Unable to get admin user credentials with message: Admin user is disabled. ERROR: Unable to authenticate using AAD or admin login credentials. Please specify both username and password in non-interactive mode.

nosalan avatar Dec 27 '22 15:12 nosalan

@jiasli Could you please let us know if the above ask to update the error message is feasible ? Awaiting your reply.

navba-MSFT avatar Dec 28 '22 04:12 navba-MSFT

We also received an internal email regarding a similar issue: Azure CLI + Azure Container Registry error

The error message in that email is:

This operation will delete the manifest 'sha256:xxx' and all the following images: 'xxx'
Unable to get AAD authorization tokens with message: 2022-12-25 16:54:47.327933 An error occurred: CONNECTIVITY_REFRESH_TOKEN_ERROR
Access to registry 'xxx.azurecr.io' was denied. Response code: 401. Please try running 'az login' again to refresh permissions.

but the response code is 401, instead of 429.

The email also mentions:

But the thing is that it was actually a transient error, my credentials were ok and when I retried the command (without running az login) it worked correctly.

so this seems to be some permission/propagation/throttling issue with ACR.

The error message Please try running 'az login' again to refresh permissions. is thrown by ACR, not Azure CLI core: https://github.com/Azure/azure-cli/blob/69e79fd69b81b28f077c365ed9502d27342ae7f7/src/azure-cli/azure/cli/command_modules/acr/_errors.py#L115-L118

Since retrying the command works without running az login, the solution would be

  • ACR command retries the HTTP request
  • The script containing az acr command retries the command

az login will only update the refresh token and access token for ARM, not ACR’s access token, so I am not sure running az login will help here. ACR command module may need to update the error message to make it more accurate and less misleading.

jiasli avatar Jan 04 '23 08:01 jiasli

The error message Please try running 'az login' again to refresh permissions. is thrown by ACR, not Azure CLI core:

https://github.com/Azure/azure-cli/blob/69e79fd69b81b28f077c365ed9502d27342ae7f7/src/azure-cli/azure/cli/command_modules/acr/_errors.py#L115-L118

Since retrying the command works without running az login, the solution would be

  • ACR command retries the HTTP request
  • The script containing az acr command retries the command

az login will only update the refresh token and access token for ARM, not ACR’s access token, so I am not sure running az login will help here. ACR command module may need to update the error message to make it more accurate and less misleading.

Yes we may need to review the error CONNECTIVITY_REFRESH_TOKEN_ERROR and fix the error message to avoid confusion. /cc @northtyphoon @luisdlp for awareness

Wwwsylvia avatar Jan 05 '23 07:01 Wwwsylvia

@Wwwsylvia @northtyphoon @luisdlp Any updates on this ?

navba-MSFT avatar Jan 10 '23 09:01 navba-MSFT

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Jan 17 '23 11:01 ghost

It has a bug label, @msftbot should be closing this.

nosalan avatar Jan 31 '23 17:01 nosalan

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @toddysm, @luisdlp, @northtyphoon.

Issue Details

Related command

az acr repository update --name $TargetAcr `
    --image "$ArtifactToPromote" `
    --write-enabled false `
    --delete-enabled false

Describe the bug

We have a script that contains a bunch of az acr import and az acr repository update commands. It is a powershell core script, which itself is running in the Azure DevOps pipeline, inside AzureCli@2 task. The script is basically simple script to import images to ACR and lock them. It works fine until following error appears sometimes(!) near the end of the script:

WARNING: Unable to get AAD authorization tokens with message: 2022-12-12 16:05:11.564324 An error occurred: CONNECTIVITY_REFRESH_TOKEN_ERROR
Access to registry '<redacted>.azurecr.io' was denied. Response code: 429. Please try running 'az login' again to refresh permissions.

After this error, basically nothing can be done but running the build again. Retry does not help because az has lost the service principal context, which was given by AzureCli@2 task.

To Reproduce Execute a script that contains a bunch (about 20) of az acr import and az acr repository update commands ran sequentially.

Expected behavior Assuming that this error comes from ACR throttling (deducted from 429 status code) a retry of az acr repository update within the script should work but it doesn't. The 'az cli' demands to run az login again, but that is not possible within AzureCli@2 task. Running of az login is not an option because the script is invoked inside of AzureCli@2 task which has a service principal context assigned through a service connection.

Environment summary Latest Azure Devops ubuntu agent AzureCli@2

Additional context

Author: nosalan
Assignees: navba-MSFT
Labels:

bug, Service Attention, Container Registry, customer-reported

Milestone: Backlog

ghost avatar Feb 01 '23 04:02 ghost

Thanks @nosalan . I have reopened this issue.

@Wwwsylvia @northtyphoon @luisdlp Any updates on this ?

navba-MSFT avatar Feb 01 '23 04:02 navba-MSFT

Hi @navba-MSFT and @nosalan , one of our engineers is actively working on this. We will keep you updated.

Wwwsylvia avatar Feb 08 '23 11:02 Wwwsylvia

See: https://github.com/Azure/azure-cli/issues/11891.

clemlesne avatar Jan 02 '24 16:01 clemlesne

Context

Note, after digging into the source code of azure-cli, azure-sdk-for-python and msrest-for-python, it seems that:

  • Some Azure services are implemented in azure-sdk-for-python or their own extension (= dedicated Git project)
  • Implementations either manage their HTTP backend with azure-sdk-for-python (new) or msrest-for-python (deprecated but still used)

azure-sdk-for-python

In azure-sdk-for-python, most of the implementations seems to catch 429 Too Many Requests errors.

msrest-for-python

In msrest-for-python, a default retry policy is well-used, implemented like:

  • Retry mechanism it managed by urllib3 lib, with the Retry object
  • Configured to happen:
    • 3 retires times maximum
    • for calls of type ['HEAD', 'TRACE', 'GET', 'PUT', 'OPTIONS', 'DELETE', 'POST', 'PATCH']
    • only for status codes [408, 500, ..., 998]

Default urllib3 retry status codes are [413, 429, 503]. Thus, we can see that retries on 413 Content Too Large and 429 Too Many Requests status codes have been deliberately removed by the implementation and [501, 502, 504, ..., 998] added.

Thus, I have pushed a PR fixing this behavior: https://github.com/Azure/msrest-for-python/pull/260.

clemlesne avatar Jan 02 '24 17:01 clemlesne