DistributedLock
DistributedLock copied to clipboard
Transient error leads to unhandled 409 when releasing an Azure lease
Describe the bug
We encounter an unhandled exception thrown by Release
or ReleaseAsync
method when transient errors happen. The exception is thrown from the Azure SDK API Release
inside ReleaseAsync
with an error code 409 (LeaseIdMismatchWithLeaseOperation):
...
this._blobLeaseClient.Release();
return default;
...
When releasing the lease, transient errors (e.g., timeout or package dropped) may happen during the transmission of the response. Release SDK API may retry releasing the lease multiple times until the lease is successfully released. This retry mechanism can lead to a 409 error even if the lease has already been released. One concrete scenario could be:
- BlobLeaseClientWrapper.ReleaseAsync sends an HTTP request to release the lease first;
- The lease is released by this first try, but a timeout happens, so the HTTP response never reaches the DistributedLock;
- The SDK API
Release
will automatically retry the release operation due to the timeout; - The second try fails and encounters a 409 LeaseIdMismatchWithLeaseOperation error since the lease has already been released by the first request;
- The Azure SDK throws this 409 exception to DistributedLock, and
ReleaseAsync
in DistributedLock would not catch it (because there is no try-catch block inReleaseAsync
).
To reproduce
When the first release request sent by Release
in ReleaseAsync
succeeds in remote but its response never reaches the client, LeaseIdMismatchWithLeaseOperation will appear in the response of the second release request, which results in an unhandled exception.
Expected behavior & Possible Solution
Add a catch block of ReleaseAsync that catches the RequestFailedException and check whether it is a 409 response. If it is, the ReleaseAsync may need to handle the exception e.g., ignore it and print some logs.
Additional Context
We are willing to submit a PR to fix it.
@rmt2021 thanks for filing and for the detailed report!
If I understand correctly, the retry behavior happens within the Azure SDK and results in these spurious 409 errors, Your proposal is to swallow the 409s.
One concern I have here is that if we make this change we lose the ability to detect the difference between this case and misuse of the API. Perhaps that's just a pill we have to swallow, but I wonder if this issue could be fixed internally within the Azure SDK's retry functionality? It seems kind of silly for them to do a retry on failure and then throw when the retry "succeeds".
- Do I understand the issue and proposed solution correctly?
- Have you filed anything with Azure SDK for this? What did they say?
- Is there something that should mitigate my concern about swallowing real errors with this change?
@rmt2021 I've attempted to file this as a bug report in Azure SDK as a starting point. If you could add any additional details there that would be helpful. What they say and whether they are willing to fix will determine what DistributedLock can/should do about this.
Sorry for the late reply @madelson.
Do I understand the issue and proposed solution correctly?
Yes!
Have you filed anything with Azure SDK for this? What did they say?
Actually I have ask them here: https://github.com/Azure/azure-sdk-for-net/issues/29673, but no one has replied yet.
Is there something that should mitigate my concern about swallowing real errors with this change?
Your concern indeed makes a lot of sense. Another way is to check whether the lease exists before releasing it, but I don't know how to achieve this in an atomic manner, and there is no API like ReleaseIfExists
.
Thanks for creating a bug report in Azure SDK!