vector icon indicating copy to clipboard operation
vector copied to clipboard

Instrument sink batching

Open spencergilbert opened this issue 3 years ago • 9 comments

Following on our buffer instrumentation, we should also look to instrument batches. Given that batches are flushed on a number of conditions additional insight is helpful for operators to optimize their pipelines.

batch sizing and number in-flight, etc

spencergilbert avatar Oct 20 '21 16:10 spencergilbert

This would be a very welcome feature for our team. Specially coupled with the recent instrumentation for Buffers. We heavily make use of the HTTP Sink and our Vector pipelines are often scaled to multiple replicas. Therefore, for tuning, it is very important to see how big are the buffers/batches being used by the HTTP Sink, specially in real-time.

For Batches, probably these metrics should at least be considered:

  • batch_events: number of events in the batch.
  • batch_byte_size: number of bytes in the batch.
  • batch_age: amount of time the batch has been under construction.

Due to concurrency and each http request having its own batch to send, maybe instead of gauges they could be histograms, but I'm not sure what you think is better. I don't think batches have the notion of dropped events (as this is for Buffers), so no need to instrument that metric I guess.

In addition, it would be very useful to also export the configured max_events, max_bytes_size and timeout_secs values (similar to Buffers instrumentation). This would allow to calculate percentages too, i.e. how full in average are the batches?.

hhromic avatar Oct 20 '21 16:10 hhromic

Porting over some of the details from a duplicate ticket that I listed, these are the metrics I would want to see come out of any work to add metrics to the batching process:

  • (gauge) total number of pending batches
  • (gauge) total number of events in pending batches
  • (gauge) total size of pending batches
  • (histogram) batch TTL (how long a batch lives before being flushed, either due to max limits or timeout)
  • (counter) total batches created
  • (counter) total batches flushed, by status (did it hit max bytes? max events? timeout?)

(Some of these overlap with @hhromic's comment, obviously.)

tobz avatar Dec 08 '21 16:12 tobz

hi! any updates here? this type of metric would be super-useful to us.

csjiang avatar Nov 21 '22 23:11 csjiang

No, we have not yet prioritized this work.

bruceg avatar Nov 21 '22 23:11 bruceg

Suggestions from a user here: https://github.com/vectordotdev/vector/issues/20284

jszwedko avatar Apr 11 '24 13:04 jszwedko

I think also docs for buffering and batching should be extended as it is also very difficult to understand relation between buffer and batch limits.

If user configures max batch size to be larger than buffer size, what will happen? If there's direct relation, then Vector should throw a warning if buffer size is lower than batch limit and should recommend buffer size being at least same as batch size or in multiples of batch size (eg. batch size * 2 to have some read-ahead from sources).

Also how does this work when ARC or concurrency > 1 is being used (then it should be batch size * expected max concurrency).

And last thing, if user has one source (eg. kafka) and multiple sinks (ES, S3) with different buffer and batch size. Will sink with smaller buffer throttle the other one?

fpytloun avatar Apr 12 '24 09:04 fpytloun

I think also docs for buffering and batching should be extended as it is also very difficult to understand relation between buffer and batch limits.

Agreed, the docs could be expanded. Putting some responses here in the meanwhile.

If user configures max batch size to be larger than buffer size, what will happen? If there's direct relation, then Vector should throw a warning if buffer size is lower than batch limit and should recommend buffer size being at least same as batch size or in multiples of batch size (eg. batch size * 2 to have some read-ahead from sources).

Sink buffers are decoupled from batching. That is: the buffer just feeds events into the sink as it gets them, and as the sink fetches them. The sink then batches those events in-memory.

There is this diagram that might help: https://vector.dev/docs/reference/configuration/sinks/vector/#buffers-and-batches

Also how does this work when ARC or concurrency > 1 is being used (then it should be batch size * expected max concurrency).

Again the buffers are decoupled from the in-memory batching so the buffer size doesn't need to be related to the batch size. You can expect one batch to be created per concurrency though.

And last thing, if user has one source (eg. kafka) and multiple sinks (ES, S3) with different buffer and batch size. Will sink with smaller buffer throttle the other one?

If the small buffer is full, yes, it will apply back-pressure before the larger buffer does. Again, though, batching is done in memory and is decoupled from buffering.

jszwedko avatar Apr 12 '24 20:04 jszwedko

@jszwedko thank you, that is what I barely remember from Discord discussion some time ago. Still I think it might be beneficial for buffer to be larger than batch size to have some read-ahead depending on the source, correct? Is there some metric or log message to know when vector applied backpressure due to slow sink. I don't remember seeing any log message for such event 🤔

fpytloun avatar Apr 15 '24 07:04 fpytloun

@jszwedko thank you, that is what I barely remember from Discord discussion some time ago. Still I think it might be beneficial for buffer to be larger than batch size to have some read-ahead depending on the source, correct? Is there some metric or log message to know when vector applied backpressure due to slow sink. I don't remember seeing any log message for such event 🤔

The utilization metric is the best one that currently exists for identifying back pressure.

Thinking about it a bit more, for in-memory buffers, I think I could see it being beneficial to have the buffer be at least as big as the batch size multiplied by the concurrency so that the next set of requests could be buffered in memory while the current set is in flight.

For disk buffers, I think having it be 2x would be beneficial since data isn't "deleted" from disk buffers until the sink delivers it.

jszwedko avatar Apr 16 '24 18:04 jszwedko