opentelemetry-dotnet
opentelemetry-dotnet copied to clipboard
OTLP Exporter Retry and Persist Feature
Feature Request
Is your feature request related to a problem?
When exporting data and a transient server issue prevents the request from being processed correctly the data will be lost. Or when a program is shutdown any data not yet exported before the process is terminated will be lost.
Describe the solution you'd like:
The data should be attempted to be exported again when the error is considered repeatable. On program shutdown data yet to be exported should attempt to do so after first saving to disk in case the transmission fails or does not have enough time to complete. Upon the next program execution the saved to disk telemetry will attempt to export. This will reduce the amount lost telemetry.
Additional Context
Add the ability to OTLP exporters to retry exports that fail in a defined way. This includes between program shutdowns by persisting the data to disk upon failure. This will help improve the reliability of OTel from the client's end.
Original GitHub Issue: https://github.com/open-telemetry/opentelemetry-dotnet/issues/1278
The first set of PRs will focus on a single to be decided section in the following matrix and follow-up PRs will be enabling the others, reusing as much code as reasonable.
Metrics | Logs | Traces | |
---|---|---|---|
HTTP | x | x | x |
gRPC | x | x | x |
src/OpenTelemetry.Exporter.OpenTelemetryProtocol
PR Roadmap
- OTLP Exporter Options: Opt-in.
-
FileBlobProvider
:- Path the folder location defined in
Storage folder
section of https://github.com/open-telemetry/opentelemetry-dotnet/issues/1278 - max size (is new or old data dropped?)
- maintenance period (the 2 minute default likely too long to release a lease)
- retention period
- write timeout
- Path the folder location defined in
- RetryThreadPeriodMilliseconds (default of 1000? or combine into 1 field together with maintenance period?)
-
- Write
- Each ExportClient will be modified
- If a retryable error is returned.
- Log the error (info level).
- Create a blob.
persistentBlobProvider.TryCreateBlob(data, RetryAfter.Milliseconds, out var blob);
- Non retryable error.
- Drop the data.
- Log the error (warning level).
- Increment a dropped item counter.
- Retry
- Implemented as a new thread that will wake up every XX milliseconds and perform a retry on all blobs it can get a lease for.
-
foreach (var blobItem in persistentBlobProvider.GetBlobs()) { ... }
-
blob.TryLease(1000); blob.TryRead(out var data);
- If successfully exported or response says to no longer retry this blob.
- Delete blob
blob.TryDelete();
- Log the error (warning level).
- Increment a dropped item counter.
- Delete blob
- If not successful, but another retryable error is returned
- Log the error (info level).
- Cleanup
- This part may already be taken care of by the
FileBlobProvider
and the aboveRetry
scenario.
- This part may already be taken care of by the
- Shutdown
- Write data to disk, then attempt to send that data.
- Exit grace period for Console-like apps. Point #3 in the original issue.
- Add
Guard
s toFileBlobProvider
interface. Ref - Optimisations
Retryable Errors
- gRPC
- UNAVAILABLE
- CANCELLED
- DEADLINE_EXCEEDED
- ABORTED
- OUT_OF_RANGE
- DATA_LOSS
- RESOURCE_EXHAUSTED - Nuanced, see the spec.
- HTTP
- 408: Request Timeout
- 503: Server Too Busy, do not retry too frequently, exponential backoff and respect 'Retry-After' header
- 5xx: Server Errors
Testing Strategy
Make use of the test/OpenTelemetry.Exporter.OpenTelemetryProtocol.Tests/MockCollectorIntegrationTests.cs
class. Some example can be seen in this closed PR. Which the changes to that file made in that PR should be reusable.
References
-
PersistentStorage
API - OTLP Specification
-
OTLP Exporter
- Specifically the
Retry
section
- Specifically the
For retry, we need to consider the following cases:
- HTTP 503 Server Too Busy - we don't want to retry too frequently, probably exponential backoff, plus respecting 'Retry-After' header if there is one.
- When we retry, what's ordering? E.g. for metrics it might make more sense to always prioritize the newest data. For logs, it might make sense to prioritize the older one?
- We might need to introduce some QoS metrics (e.g. SDK internal queue size, error count, retry count), these metrics might need to be treated as high priority, so if the exporter is too busy to handle all the data, at least we have these QoS data prioritized.
Confirming the end result here:
- OTLP Exporter Options: Opt-in.
All of the work planned for this issue stays as opt-in, correct?
Confirming the end result here:
- OTLP Exporter Options: Opt-in.
All of the work planned for this issue stays as opt-in, correct?
That is correct. This will be an opt-in feature until we have spec.
This will be an opt-in feature until we have spec.
Is there related spec work for these options in-progress already?
If a retryable error is returned. Log the error (info level). Create a blob.
A retry policy is useful independently from persistent storage. That is, the gRPC client (or Polly if using HTTP) could be configured with a retry policy which can handle transient network errors. This handling would be opaque to the OTLP exporter.
Do you plan to implement retry w/o also requiring the use of persistent storage?
Do you plan to implement retry w/o also requiring the use of persistent storage?
We should do this and is required by spec https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#retry
Persistent storage is an optional, opt-in feature.
Just wanted to share a thought. @alanwest kind of scratched at this on the SIG yesterday, I think. What does "persistent storage" mean? In the scope of this work, it seems to me we persist the data for retries. I'm working on an exporter for some client teams some (maybe all) of which seem to want "persistent storage" but of the always & up-front variety. Meaning exporter just writes to disk and then some other thread tries to ship off the disk data on its own cadence.
Only sharing so we can be clear what kind of "persistent storage" we aim to support in OTLP and make sure the documentation is also clear 😄
Are there any plans still in motion regarding some kind of persistence support? Once OpenTelemetry.PersistentStorage
is released is the intention to integrate it into the otlp exporter or create some persisted otlp exporter?
https://opentelemetry.io/docs/specs/otel/library-guidelines/#protocol-exporters
if an application’s telemetry data must be delivered to a remote backend that has no guaranteed availability the end user may choose to use a persistent local queue and an Exporter to retry sending on failures.
This is more or less the scenario I would like to cater for, similar to what @CodeBlanch mentioned. Similar to how the collector has this supported https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md#persistent-queue .
Right now the issue that I experience is if create my own BatchExportProcessor
using a persisted local queue and try re-use the OtlpTraceExporter.Export
function, I then need to serialize/deserialize System.Diagnostics.Activity
to store it on some persistent queue somewhere, serializing the Activity
object does not really seem like a viable option. The only other option I am left with, unless I am missing something, is to implement the entire OTLP exporter myself and do the mapping to the proto types and store the serialization of that format on my local file queue.
Am I correct that there is no retry by default? We would need to inject a retry policy by ourselves in the HttpClientFactory (if using Http) with e.g. Polly? If so we would need to catch the error codes described in the spec? tks
Am I correct that there is no retry by default? We would need to inject a retry policy by ourselves in the HttpClientFactory (if using Http) with e.g. Polly? If so we would need to catch the error codes described in the spec? tks
True. See https://github.com/open-telemetry/opentelemetry-dotnet/issues/1779 . Some PRs are in-flight now to make this happen automatically, so you don't have to manually deal with it.