opentelemetry-python
opentelemetry-python copied to clipboard
OTLPMetricExporter fails to send more than 4MB of data
Describe your environment Python 3.6.8 Opentelemetry Python 1.12.0 Opentelemetry Collector 0.38.0
Steps to reproduce Metrics SDK with quite a few observables generating more than 4MB of data (datapoints)
What is the expected behavior? The datapoints are sent to the collector without any problem
What is the actual behavior? The export to the collector fails with StatusCode.RESOURCE_EXHAUSTED The exporter keeps on sending the same batch over and over until the data gets dropped
Additional context One solutions would by to have a configurable "max batch size" in the OTLPMetricExporter, like there is today in the BatchLogProcessor for logs. Another solution would be for the OTLP gRPC exporter to automatically retry with a smaller batch if it receives a StatusCode.RESOURCE_EXHAUSTED ?
Thanks for trying out the metrics SDK 🙂
The difference between metrics and trace/logs here is that all the metrics come in at once and there is no batching in the SDK. It is simply evaluating all of observable instruments and it's causing the issues. Do folks think we should add this batching mechanism to the PeriodicExportingMetricReader so it can buffer spans into the exporter?
Another solution would be for the OTLP gRPC exporter to automatically retry with a smaller batch if it receives a StatusCode.RESOURCE_EXHAUSTED ?
+1, there is this proposal https://github.com/open-telemetry/opentelemetry-proto/pull/390 which I believe would make this work?
Do folks think we should add this batching mechanism to the PeriodicExportingMetricReader so it can buffer spans into the exporter?
How does this help? The volume of data that reaches exporter would still be of same size for each export cycle, right?
As a place to configure the batch size, and then the PeriodicExportingMetricReader will call the exporter once per batch. @srikanthccv would you prefer to just have the individual exporters handle batching on their own?
Yes, I think we already do this in some exporters which take care of some protocol/coding specific limits. I would like to hear more the batching in metrics. I am trying to understand if each collect only limits the number of points collected and then call the exporter or the collection get all data once and then calls exporter multiple times?
We briefly discussed this today. We should look at the spec for what is the correct status code for errors related to payload size and see if the response from collector includes what's the acceptable size to devide the batch to chunks before exporting.
It would be great to have this kind of behavior but I think we should also be able to configure a "max batch size" in the OTLPMetricExporter. If at each interval I generate 5000 data-points for a size of 6MB I don't want the first export request to fail every time. The automatic downsizing of the batch when receiving StatusCode.RESOURCE_EXHAUSTED would be great for sporadic errors.
It would be great to have this kind of behavior but I think we should also be able to configure a "max batch size" in the OTLPMetricExporter.
@overmeulen any chance you'd be willing to send a PR for this?
Sure. So the idea would be to do the fix directly in the gRPC exporter ? https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/init.py
Thanks, I'll assign you the issue. We haven't implemented the HTTP exporter yet, so that seems reasonable to me.
Is there some env or something similar spec'd to configure this max batch size?
I was thinking of doing something similar to BatchLogProcessor https://github.com/open-telemetry/opentelemetry-python/blob/main/opentelemetry-sdk/src/opentelemetry/sdk/_logs/export/init.py#L146
PR created and ready to be reviewed
Just adding the discussions from the PR #2809 which adds a max_export_batch_size. @overmeulen said
I don't really have a recommendation for the max_export_batch_size, it highly depends on the type of metrics and the number of attributes... Batching on the byte size would indeed be better but much more complex. The idea here was to add a first level of protection against this 4MB limit but as you said it won't completely prevent you from reaching the limit from time to time.
We are going ahead with this for now to keep it simple. Two alternatives would be
- Calling
ByteSize()on the protobufs to check the request doesn't exceed 4MB (or a configurable limit) before sending. This could be computationally expensive if not done carefully since byte size calculation is recursive (I believe protobuf lib does cache this though). But it would keep the batches as large as possible. - Split the original requests into chunks after receiving a RESOURCE_EXHAUSTED response as mentioned in https://github.com/open-telemetry/opentelemetry-python/issues/2710#issuecomment-1139342126
There's also the option of sending requests in parallel, where #2809 is sending each chunk serially.
See also https://github.com/open-telemetry/opentelemetry-specification/issues/2772