opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

OTLPMetricExporter fails to send more than 4MB of data

Open overmeulen opened this issue 3 years ago • 12 comments

Describe your environment Python 3.6.8 Opentelemetry Python 1.12.0 Opentelemetry Collector 0.38.0

Steps to reproduce Metrics SDK with quite a few observables generating more than 4MB of data (datapoints)

What is the expected behavior? The datapoints are sent to the collector without any problem

What is the actual behavior? The export to the collector fails with StatusCode.RESOURCE_EXHAUSTED The exporter keeps on sending the same batch over and over until the data gets dropped

Additional context One solutions would by to have a configurable "max batch size" in the OTLPMetricExporter, like there is today in the BatchLogProcessor for logs. Another solution would be for the OTLP gRPC exporter to automatically retry with a smaller batch if it receives a StatusCode.RESOURCE_EXHAUSTED ?

overmeulen avatar May 24 '22 07:05 overmeulen

Thanks for trying out the metrics SDK 🙂

The difference between metrics and trace/logs here is that all the metrics come in at once and there is no batching in the SDK. It is simply evaluating all of observable instruments and it's causing the issues. Do folks think we should add this batching mechanism to the PeriodicExportingMetricReader so it can buffer spans into the exporter?

Another solution would be for the OTLP gRPC exporter to automatically retry with a smaller batch if it receives a StatusCode.RESOURCE_EXHAUSTED ?

+1, there is this proposal https://github.com/open-telemetry/opentelemetry-proto/pull/390 which I believe would make this work?

aabmass avatar May 24 '22 16:05 aabmass

Do folks think we should add this batching mechanism to the PeriodicExportingMetricReader so it can buffer spans into the exporter?

How does this help? The volume of data that reaches exporter would still be of same size for each export cycle, right?

srikanthccv avatar May 24 '22 17:05 srikanthccv

As a place to configure the batch size, and then the PeriodicExportingMetricReader will call the exporter once per batch. @srikanthccv would you prefer to just have the individual exporters handle batching on their own?

aabmass avatar May 24 '22 18:05 aabmass

Yes, I think we already do this in some exporters which take care of some protocol/coding specific limits. I would like to hear more the batching in metrics. I am trying to understand if each collect only limits the number of points collected and then call the exporter or the collection get all data once and then calls exporter multiple times?

srikanthccv avatar May 24 '22 18:05 srikanthccv

We briefly discussed this today. We should look at the spec for what is the correct status code for errors related to payload size and see if the response from collector includes what's the acceptable size to devide the batch to chunks before exporting.

srikanthccv avatar May 26 '22 18:05 srikanthccv

It would be great to have this kind of behavior but I think we should also be able to configure a "max batch size" in the OTLPMetricExporter. If at each interval I generate 5000 data-points for a size of 6MB I don't want the first export request to fail every time. The automatic downsizing of the batch when receiving StatusCode.RESOURCE_EXHAUSTED would be great for sporadic errors.

overmeulen avatar May 27 '22 07:05 overmeulen

It would be great to have this kind of behavior but I think we should also be able to configure a "max batch size" in the OTLPMetricExporter.

@overmeulen any chance you'd be willing to send a PR for this?

aabmass avatar Jun 23 '22 16:06 aabmass

Sure. So the idea would be to do the fix directly in the gRPC exporter ? https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/init.py

overmeulen avatar Jun 23 '22 16:06 overmeulen

Thanks, I'll assign you the issue. We haven't implemented the HTTP exporter yet, so that seems reasonable to me.

aabmass avatar Jun 23 '22 17:06 aabmass

Is there some env or something similar spec'd to configure this max batch size?

srikanthccv avatar Jun 23 '22 19:06 srikanthccv

I was thinking of doing something similar to BatchLogProcessor https://github.com/open-telemetry/opentelemetry-python/blob/main/opentelemetry-sdk/src/opentelemetry/sdk/_logs/export/init.py#L146

overmeulen avatar Jun 23 '22 19:06 overmeulen

PR created and ready to be reviewed

overmeulen avatar Jul 06 '22 13:07 overmeulen

Just adding the discussions from the PR #2809 which adds a max_export_batch_size. @overmeulen said

I don't really have a recommendation for the max_export_batch_size, it highly depends on the type of metrics and the number of attributes... Batching on the byte size would indeed be better but much more complex. The idea here was to add a first level of protection against this 4MB limit but as you said it won't completely prevent you from reaching the limit from time to time.

We are going ahead with this for now to keep it simple. Two alternatives would be

  • Calling ByteSize() on the protobufs to check the request doesn't exceed 4MB (or a configurable limit) before sending. This could be computationally expensive if not done carefully since byte size calculation is recursive (I believe protobuf lib does cache this though). But it would keep the batches as large as possible.
  • Split the original requests into chunks after receiving a RESOURCE_EXHAUSTED response as mentioned in https://github.com/open-telemetry/opentelemetry-python/issues/2710#issuecomment-1139342126

There's also the option of sending requests in parallel, where #2809 is sending each chunk serially.

aabmass avatar Sep 07 '22 15:09 aabmass

See also https://github.com/open-telemetry/opentelemetry-specification/issues/2772

aabmass avatar Sep 07 '22 15:09 aabmass