Send RetryInfo on OTel Timeouts
Description
DataPrepper is sending RESOURCE_EXHAUSTED gRPC responses whenever a buffer is full or a circuit breaker is active. These statuses do not contain a retry info. In the OpenTelemetry protocol, this implies a non-retryable error, that will lead to message drops, e.g. in the OTel collector. To apply proper back pressure in these scenarios a retry info is added to the status.
Issues Resolved
Resolves #4119
Check List
- [x] New functionality includes testing.
- [ ] New functionality has a documentation issue. Please link to it in this PR.
- [ ] New functionality has javadoc added
- [x] Commits are signed with a real name per the DCO
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
@dlvenable this PR shows how to add a RetryInfo. It is lacking a proper determination of a retry delay, which is currently hard-coded to 100ms. I can progress with my proposal from https://github.com/opensearch-project/data-prepper/issues/4119#issuecomment-1956608213. It would be nice to get configuration data for an initial delay into GrpcRequestExceptionHandler. Is there an example to go by?
@dlvenable this PR shows how to add a
RetryInfo. It is lacking a proper determination of a retry delay, which is currently hard-coded to 100ms. I can progress with my proposal from #4119 (comment). It would be nice to get configuration data for an initial delay intoGrpcRequestExceptionHandler. Is there an example to go by?
@KarstenSchnitter , What do you think about this configuration?
source:
- otel_traces_source:
retry_info:
initial_delay: 100ms
max_delay: 2s
Here is a code example of a nested configuration:
https://github.com/opensearch-project/data-prepper/blob/6a30c6f4823cc45c3d6c63651871179cdf1e19dc/data-prepper-plugins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/S3SourceConfig.java#L45-L47
The actual implementation is SqsOptions which is another simple POJO class.
@KarstenSchnitter , What do you have remaining to make this PR ready for review? We did discuss having it be configurable, but anything else to add?
I am mostly lacking time to make the required changes :wink::
- merge last changes on main
- configuration of the minimal and maximum delay
- integration test
- manual test with OTel Collector
I got help by Tomas Longo, who provided the missing configuration and tests. We also tested, that the RetryInfo is correctly picked up by the OpenTelemetry Collector. With this change Data Prepper exercises back-pressure if the circuit breakers are active.
There was a slight issue with the initialisation of the RetryCalculator, that showed up in the tests. This is now fixed. @dlvenable: Can you take another look at this change?
@dlvenable I renamed the tests. Can you have a look again. I think, that the build failures are caused by different components, not this changeset.