opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Clarify whether BatchSpanProcessor scheduledDelayMillis is between batches or entire processing of queue

Open anuraaga opened this issue 4 years ago • 6 comments

Currently we define scheduledDelayMillis, maxQueueSize, and maxExportBatchSize.

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#batching-processor

It's not clear to me whether scheduledDelayMillis is supposed to be time in between exports of chunks of size maxExportBatchSize, or if it's the delay between processing the queue, when processing we export as many batches as we can, and then wait for the delay amount until the next set of batches. The wording in the spec seems closer to the former, but go and java at least seem to implement as the latter.

anuraaga avatar Aug 21 '20 04:08 anuraaga

Hey - this question came up. Does anyone happen to have any suggestions on how to handle schedule delay?

anuraaga avatar Mar 15 '21 23:03 anuraaga

It's not clear to me whether scheduledDelayMillis is supposed to be time in between exports of chunks of size maxExportBatchSize, or if it's the delay between processing the queue

IMHO it has to be the latter, otherwise you would be severely limiting throughput.

EDIT: The first time you see more than one batch worth of spans in the queue, then it means that in the last scheduledDelayMillis, more than one batch worth of spans was added. Assuming this was not a temporary spike, that list will only keep growing, as waiting between every batch does little to make it shrink.

Oberon00 avatar Mar 19 '21 08:03 Oberon00

@Oberon00 Yeah I think that makes sense, it actually was a silly question. I think what I meant to ask, was this other one, should we eagerly export batches? Currently Java exports as soon as the queue is over maxExportBatchSize regardless of interval - I think doing so could cause BSP to be more active than one might want when at low throughput. At high throughput, I think there isn't a big difference since you'd be continually processing the queue anyways. But there does seem like there could be a middle ground that has degraded performance, not sure though.

Edit: I guess main concern is that with eager exporting, user seems to lose control. If they want fast exports, they could just reduce schedule delay, but there's no way to have slower exports if we default to eager behavior.

anuraaga avatar Mar 19 '21 08:03 anuraaga

+1 in favor of eager exporting

iNikem avatar Mar 21 '21 09:03 iNikem

I think the wording should be something akin to:

scheduledDelayMillis - the maximum delay interval in milliseconds between two consecutive exports. The default value is 5000.

The current version is a bit ambiguous (or perhaps even understood as a loop where exports happen after scheduledDelayMillis). Most SDKs implement eager batch exporting: Java, Ruby, Swift, Python

Stumbled upon this in the JS SDK, where exporting happens exactly with scheduleDelayMillis. This means with the default configuration (512 spans each 5s) the span throughput is limited to ~100 spans per second under constant load, which is a very low number. All other spans are silently dropped.

Agree with eager exporting 😄

seemk avatar Jul 14 '22 12:07 seemk

I am also in support of eager exporting. I see 2 obvious ways to do it:

  1. When the scheduledDelayMillis timer is triggered, export as many batches as required to empty the queue
  2. When there are maxExportBatchSize spans in the queue, trigger an export even if it means exporting early

dyladan avatar Jul 14 '22 13:07 dyladan

@open-telemetry/technical-committee I think this is quite an important issue that likely affects many SDKs. As @morigs pointed out in the related js issue https://github.com/open-telemetry/opentelemetry-js/issues/3094 the current situation effectively works as a rate-limiter which I believe is not be the intended effect. JS would like to implement a fix for this, but we don't want to do something against the spec.

dyladan avatar Nov 07 '22 14:11 dyladan

imo the export should happen once the desired batch size is reached or the delay timer ticks, whichever is earlier. The former protects against unnecessarily large batches, the latter protects against sitting on the data that trickles in slowly.

yurishkuro avatar Nov 07 '22 15:11 yurishkuro

imo the export should happen once the desired batch size is reached or the delay timer ticks, whichever is earlier. The former protects against unnecessarily large batches, the latter protects against sitting on the data that trickles in slowly.

This. This is also exactly what we do in the Collector if I remember correctly.

If the spec does not describe this behavior I would argue it is a spec bug.

tigrannajaryan avatar Nov 08 '22 23:11 tigrannajaryan

There is also ForceFlush which we might want to cover. In metrics, this is what we tried https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#periodic-exporting-metricreader.

reyang avatar Nov 09 '22 00:11 reyang