azure-sdk-for-net
azure-sdk-for-net copied to clipboard
[Feature Request] Consider allowing the ability to opt-out of distributed tracing for expected HTTP error responses
Summary
In some situations, service calls return an HTTP response code in the range normally interpreted as an exception though that response is an expected and normal scenario to the consumer of the associated client library. For example, when using an If-Match constraint with the Storage client library when updating metadata for a blob, a mismatch in the ETag results in an HTTP 412 (Precondition Failed) response.
While this status code is semantically correct, it is also considered an expected and normal scenario to some callers. For example, the Event Processor client library will attempt to claim ownership of a partition by making a conditional request. Since multiple processor instances are competing for ownership, it is expected that the conditional request will not succeed for the majority of them. In this case, the Event Hubs client handles the resulting storage exception and considers it as part of the normal application path.
However, within the Storage client, the diagnostics scope considers the request a failure an emits the associated exception data. When customers using the Event Hubs processor are integrating with Application Insights, they are seeing diagnostics information that indicates errors in their application, though these entries are not indicating actual problems and are just noise.
It may be worth considering if we wanted to include a feature that would allow consumers of a client library to declare expected response failures as such and avoid emitting error diagnostics for them.
Related Items
- [BUG] Event Hub Processor Host - creating errors when checkpointing to blob (#9908)
//cc: @pakrym, @tg-msft
cc @lmolkova
Distributed tracing goal is to provide observability. We are tracing both: successful and unsuccessful requests i.e. you can tell that checkpointing failed or took longer than expected because there was something happening underneath. E.g. if you have a storage outage or configuration issue, and you know it's not EventHub SDK that misbehaves or it's not your app issue, but external dependency issue, but your storage call.
If you exclude (do not log the call to storage) customers could be confused with questions why it took so long, was this time spent in my process?
So I don't believe that simply not logging a call will help, but instead will bring other issues.
It seems the problem is that we should not consider 412 as failure (but still log the call). And it's complicated here because component above (checkpoint) thinks it's a valid status code, but component below (storage) thinks it's a failure.
There are very few controversial cases like that (401. 403, 404, 409, 412, 429 and perhaps that's it) where it's hard to tell whether it's an error or success.
So the proposal I see is: let's tell storage that we expect 412 here and it's ok, so let's not mark it as failure.
Here is why I think it will not solve the problem:
The behavior to mark them as failures is actually Azure Monitor behavior (and even without EventHub SDK instrumentation, it will log HTTP calls to storage as failures). We have the same logic for any other HTTP request (incoming and outgoing) and even if EventHub and Storage can change it, it will be inconsistent with telemetry reported for customer service.
Azure Monitor provides filters on the AppMap to filter out 4xx and we certainly need to do more to cover such cases.
I created a draft PR to patch up Storage as suggested at #9944.
We need a guidance from Azure Monitor here.
I believe this is the right issue even though the title could be updated for more general use cases.
In our situation, we make calls to get a blob that may or may not exist; if it does, we do something, if it doesn't, we do something else. In this case, getting 404 is a legit case. Dealing with this using an exception means:
- We get error logs in Application Insights and in our log files; we can filter Error and Warning logs, but then we also lose legitimate
Azure-Coreerrors. - While it's not major, we still "pay the price" of having exceptions thrown, while it's a best practice not to throw exceptions (mostly for performance reasons) for such cases. We could make a call to ExistsAsync before, but then we pay for a full roundtrip, and it won't guarantee that in the meantime the file wasn't created/deleted.
409 Conflict is also an expected case, for example for caching we might check if an entry exists, if not we generate the data and save it. If another instance saved it too in the meantime, it's fine.
As @lmolkova said, it depends on the specific case, so I also think what's needed here is a way to specify expected status codes when doing the request, for example:
await blobClient.DownloadToAsync(
stream,
expectation: new BlobRequestExpectations
{
successStatusCodes = new[]
{
StatusCodes.Status200OK,
StatusCodes.Status404NotFound
}
}
);
In other words, I think this issue could be renamed to something more general like "Allowing the ability to specify successful status codes instead of throwing RequestFailedException"
So I don't believe that simply not logging a call will help, but instead will bring other issues.
I agree with this. It still needs to be logged, but ideally as a warning by default and not an error.
For example, when calling Azure.Storage.Files.Shares.ShareClient.CreateIfNotExistAsync I can see that the client is logging this as an error:
at Azure.Storage.Files.Shares.ShareRestClient.CreateAsync(Nullable`1 timeout, IDictionary`2 metadata, Nullable`1 quota, Nullable`1 accessTier, String enabledProtocols, Nullable`1 rootSquash, CancellationToken cancellationToken) at Azure.Storage.Files.Shares.ShareClient.CreateInternal(IDictionary`2 metadata, Nullable`1 quotaInGB, Nullable`1 accessTier, Nullable`1 enabledProtocols, Nullable`1 rootSquash, Boolean async, CancellationToken cancellationToken, String operationName)
If I'm not mistaken, this is an explicit choice? It seems it's calling ClientConfiguration.Pipeline.LogException(ex); to log that exception.
If it was logged as a Warning, at least it wouldn't clog my search for real issues in Application Insights.
Is there any progress on this? Seeing that this issue has existed for 2 years makes me kinda sad.
I hook into the ILogger for some custom log targets, and I get this noise as well which is incredibly annoying.
The required APIs should be available now in Azure.Core to be able to implement this feature. In order to suppress distributed tracing for a given status code, AddClassifier() should be called on RequestContext prior to passing it to the method in question. An example can be found in the TableServiceClient here.
Please reach out with any remaining questions on implementation.
@annelo-msft What about the Azure.Storage.Blobs client? As far as I can see, this is only implemented in the Azure.Data.Tables client, and only specifically for 404 on DeleteIfExists, 409 on CreateTableIfNotExists, and 404 on GetEntityIfExists.
Or is there a way that I, as a user of these libraries, can modify the RequestContext instead of having to rely on the (relatively slow) implementation on the Azure SDK team's part?
Hi @Archomeda - you're correct that this is only implemented in Azure.Data.Tables currently. As I understand it, this issue tracks the work to incorporate the feature in the Storage libraries. @seanmcc-msft is driving that work. My understanding of the process is that the client feature crew will work with architects to determine how to expose this functionality on those libraries. They do have the option to expose the RequestContext directly to give callers full control of the pipeline on a per-request basis.
Related (duplicate): #33690
Hi @jsquire, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.