Track size of emitted batches in a monitoring metric
right now there's no visibility into how the pipeline.batch.size and pipeline.batch.interval parameters are influencing the performance of the pipeline.
It should be useful to understand if, for example, by setting the batch size to 1000, are the inputs generating 1000 events before the interval triggers and pushes the incomplete batch into filters+outputs, or if most batches end up being 200 events or less.
This means it would be nice to have a sense of how often batches are fully filled out and when they are not. Also, if batches aren't getting maxed, it'd be interesting to maybe know what their size is.
so maybe logstash should expose metric such as, for each time interval (like the one used for metric snapshots):
- the average or mean batch size
- the minimum and maximum sizes of a batch observed in that period
- the standard deviation of the batch size
Most of these metrics would required keeping the batch size count for all batches in that time period which may be too heavy. Any suggestions on other ways of measuring this?
@jsvd I think this should be pretty cheap to measure, you only need one or two longs tops for each right?
- the average or mean batch size
Here you only need one long to count the number of batches, the number of events we already track I think and just divide right?
- the minimum and maximum sizes of a batch observed in that period
2 int (one for min and one for max) that get atomically updated should do it here, this should also be cheap?
- the standard deviation of the batch size
This would require 2long I think. If we already have the overall event count (S1), we can simply track the sum of the square of all batch sizes (S2) and the number of batches processed (N) and calculate the standard deviation from that via sqrt(N * S2 - S1 ^ 2 ) / N right?
Hope this helps :)
Here you only need one long to count the number of batches, the number of events we already track I think and just divide right?
yep that solves the average, but not the mean, but maybe it isn't that necessary/meaningful here.
This would require 12long I think
s/12/2, right?
Another way we could track this is using percentage full instead of batch sizes. Tracking (event count) / (number of batches * batch.size), so that..
jruby-1.7.26 :032 > arr
=> [10, 10, 10, 10, 9, 10]
jruby-1.7.26 :033 > batch_size = 10
=> 10
jruby-1.7.26 :034 > arr.inject(0, &:+) / (batch_size.to_f * arr.size)
=> 0.9833333333333333
Good thing about percentages is not being influenced by parametrization of the batch size
@jsvd
s/12/2, right?
Jup fixed :)
Good thing about percentages is not being influenced by parametrization of the batch size
This is a neat point that also applies to the question of whether or not mean gets us anything here :)
Why don't we just track the percentiles of:
100%, <75%, <50% and <25% full. That should give us all the knowledge we need shouldn't it? (obviously we can change the actual numbers, maybe 90% or so will turn out to be more meaningful).
@original-brownbear Sorry for jumping in, but I'm slightly worried about situations where actual batch size might be completely different from what you expect it to be due to misconfig or misbehaviour. I think I've noticed something like that in https://github.com/elastic/logstash/issues/7243
So using only percentage to represent will not necessarily give full/proper information here.
@shoggeh no worries :)
I think my suggestion around percentages was just to get a picture of the distribution of batch sizes.
Are you worried they will exceed the configured maximum size and we'll miss that? It seems like any other case would become very visible if we count the percentiles the lengths into?
This should totally be a thing. Measuring the size of the batches, in both document count and size in bytes, is also crucial for tuning throughput between the output plugin and the target. In many cases, you want to cram as big a batch as you can into the timeout window your output plugin might be restrained to (elasticsearch bulk API, for example). Without a metric to measure the size of the batches, it's a guessing game if you'll be over or under your output's timeout window to send each batch.
Key points of the problem
Given that we want to collect the size of the batch in terms of documents count and byte size of all documents, in a statistic way, some problems has to be defined:
- when measures has to be taken
- how to store the statistical data
- how to measure the byte size of an batch of events
1. When measures has to be taken
Logstash pipelines, inside the filter section, can modify the number of events and size, for example by cloning events in a batch it will increase the total size of the batch, by enriching events adding other fields. The metrics required has the goal to understand how the batch size and poll interval influences the fulfilment of a batch, the meaningful place to take such measure is at the creation of a batch. While events count can be easily done at batch creation, the computation of the byte size would require to iterate on each event and eventually compute the byte size for each event. To avoid spending time on the read side of the queue (so filters and outputs CPU time) the byte size can be pre-calculated when the Event is being inserted into the queue.
2. How to store the statistical data
Given that this is a statistical analysis of series of data (document count per batch and size in bytes of the batch itself), where we want know mean, average and some percentiles, HdrHistogram could come to rescue. HdrHistogram needs two data the maximum value to store and the precision. While for document count the maximum is batch size configured in the setting, for the dimension in bytes it's unbounded, but we can set it to the max heap size.
3. How to measure the byte size of an batch of events
From an absstract view, calculating the size of an event, given that it's a map which could nest other maps, could be done by recursively iterate, computing key size and value size. However, Logstash Event has two fields which contains data, data and metadata; both are instances of ConvertedMap which means that the first 1000 keys (strings) are interned, so the key is just a reference to sort of static string, which doesn't effectively sum up to the total size. This is true only for the first layer of data, not for nested maps in event fields.
Another item to keep in mind when calculating the size of a Java object regards also the JVM object header. If we consider to take care of that we have to know more details from the JVM that's running and how it's configured. That would provide a more accurate result, but maybe doesn't worth the effort and complexity that carries.
A final thought on size calculation, it can be done by scrolling a batch and computing lazily for each event, but that stole CPU time to the filters and outputs section. That size can be computed upfront, when an event is inserted into the queue (both in memory and persisted).
A few items we should scope:
- the set of metrics we'd like to grab and their description, something like this:
| Metric Name | Unit | Description |
|---|---|---|
| batch.event_count.current | events | Number of events in the most recently processed batch |
| batch.event_count.lifetime | events | Average events per batch since Logstash start |
| batch.event_count.average.last_1_minute | events | Average events per batch over the last 1 minute |
| batch.event_count.average.last_5_minutes | events | Average events per batch over the last 5 minutes |
| batch.event_count.average.last_15_minutes | events | Average events per batch over the last 15 minutes |
| batch.event_count.p50.last_1_minute | events | 50th percentile (median) batch size over the last 1 minute |
| batch.event_count.p50.last_5_minutes | events | 50th percentile batch size over the last 5 minutes |
| batch.event_count.p50.last_15_minutes | events | 50th percentile batch size over the last 15 minutes |
| batch.event_count.p95.last_1_minute | events | 95th percentile batch size over the last 1 minute |
| batch.event_count.p95.last_5_minutes | events | 95th percentile batch size over the last 5 minutes |
| batch.event_count.p95.last_15_minutes | events | 95th percentile batch size over the last 15 minutes |
| batch.byte_size.current | bytes | Estimated byte size of the most recent batch |
| batch.byte_size.lifetime | bytes | Average batch size since Logstash start |
| batch.byte_size.average.last_1_minute | bytes | Average batch size over the last 1 minute |
| batch.byte_size.average.last_5_minutes | bytes | Average batch size over the last 5 minutes |
| batch.byte_size.average.last_15_minutes | bytes | Average batch size over the last 15 minutes |
| batch.byte_size.p50.last_1_minute | bytes | 50th percentile of batch size over the last 1 minute |
| batch.byte_size.p50.last_5_minutes | bytes | 50th percentile of batch size over the last 5 minutes |
| batch.byte_size.p50.last_15_minutes | bytes | 50th percentile of batch size over the last 15 minutes |
| batch.byte_size.p95.last_1_minute | bytes | 95th percentile of batch size over the last 1 minute |
| batch.byte_size.p95.last_5_minutes | bytes | 95th percentile of batch size over the last 5 minutes |
| batch.byte_size.p95.last_15_minutes | bytes | 95th percentile of batch size over the last 15 minutes |
-
dig a bit more into the storing strategy: the HdrHistogram mention without context or alternative begs more details. For example: could or should we measure per pipeline worker and aggregate when emitting the metrics to the metric store?
-
the gradual size estimation vs 1shot estimation tradeoff isn't clear..especially when considering the size and event complexity axis. anything we could do to have a better sense which direction we should take?
dig a bit more into the storing strategy: the HdrHistogram mention without context or alternative begs more details. For example: could or should we measure per pipeline worker and aggregate when emitting the metrics to the metric store?
HdrHistogram is just a tool to store those measures and grab statistical data without need us to implement some smart way to store a huge number of samples, HdrHistogram does that well out-of-the box. In particular is able to keep constant and foreseeable memory consumption, independently of the number of measures we submit. Regarding the aggregation, every worker independently and serially (as the current implementations) extracts a batch from the queue, so it's that the point where the metrics are computed and inserted into the store.
the gradual size estimation vs 1shot
It's not clear to me what you mean for gradual and one-shot estimation. Each event has to be singularly analysed to compute its size, which could be done on the queue insertion phase, burning CPU on the inputs side or could be done on the queue reader side, which burns CPU on the filters side. How much CPU % is consumed is not yet clear and depends on how we do the estimation.
the gradual size estimation vs 1shot
One way would be to add an event metadata field to each event and keep updating it as the events is created/modified by the inputs. Another is to compute the size when the metrics are taken for the batch (requiring full event iteration). Also we could compute things different between the memory and the persistent queue. For example, in the persistent queue we have access to the byte array representation of the events, so we could use this size as an approximation of the event size (how precise would it be, can we estimate it?) .
Regarding the aggregation, every worker independently and serially (as the current implementations) extracts a batch from the queue, so it's that the point where the metrics are computed and inserted into the store.
I was just considering if we could decouple the metric computation (done per thread) and inserting into the store, given we will be computing about 20 or so more metrics {bytes , event_count } x {avg, p50, p95, p99} x {current, lifetime, 1, 5, 15 min} If the performance impact is not significant this is not necessary.
I was just considering if we could decouple the metric computation...
My fault, I gave for granted too much. When using HdrHistogram, we just pass the raw measurements to the it and then the percentiles are calculated by the storage itself. Ideally for the full list of metrics you listed, we need 10 HdrHistogram instances |{bytes , event_count }| x |{current, lifetime, 1, 5, 15 min}| and each provide the avg, p50, p95, p99 percentiles. The library I proposed has the recording operation time that is in the order of nanoseconds, so I don't think is necessary to have the splitting and aggregation. The percentiles computation happens when the metrics are exposed, through the usual metrics infrastructure we have in LS.
However this is just a proposal, we can always use something else or go with a custom built solution. In that case maybe worth to measure the insertion and computation times, and if they are too high think about the splitting of the computation per thread and a consequent aggregation.
we could use this size as an approximation of the event size (how precise would it be, can we estimate it?) .
That serialisation format is CBOR, so I'll set up a test that measure the difference in various cases, like deep nested maps vs wide.
Hi @jsvd in issue #17736 and study PR #17758 were posed some questions about accuracy vs performance of 3 different size computation techniques:
- using CBOR serialization.
- using the JOL library to compute the retained size of the objet graph rooted in the event instance.
- using navigation of ConvertedMaps.
The outcomes are that JOL, when used to compute retained size, consider also a big chunk of the JRuby runtime, so bias the measure. CBOR produces values comparable with the custom navigation of maps till the event doesn't have deep structure and small values. This is mainly due to the fact that in ConvertedMaps the keys are interned values. On performance side, the navigation of ConvertedMaps is more performant than CBOR and JOL, ~~but it's 7000 evt/sec, not rocket speed~~ CBOR and converted maps navigation are in the order of millions ops per second when processing almost real world uses cases with messages sized 4KB.