elasticsearch Include document size information in ingest stats

Ingest pipelines can change the size of the ingested documents, sometimes substantially. It'd be awfully useful if the ingest component of nodes stats included information about the total number of bytes received by each ingest pipeline and the total size of the resulting documents.

Mar 15 '24 14:03 DaveCTurner

Pinging @elastic/es-data-management (Team:Data Management)

Mar 15 '24 14:03 elasticsearchmachine

Hi @DaveCTurner, I was looking into trying to implement this, but had one question/concern:

So it seems like calculating the size of the documents ingested/produced should happen around when the other ingest stats are being set (for example for Pipeline it looks like that's here) So then the next course of action would probably be to estimate the size of the source from the ingestDocument, and the source there is a IngestCtxMap, but there doesn't seem to be a method for calculating the size in bytes for that. Which means we might need to loop through each entry in the CtxMap and calculate its size that way, but that seems like that might be expensive. I was wondering if I should go ahead with that implementation or if you had any thoughts about it?

Mar 28 '24 07:03 limotova

I haven't really thought about how one might implement this, but I would expect that we already know the actual size of the buffers we allocate to hold the incoming document, and the result of the pipeline. I also expect these numbers aren't exposed exactly where they're needed today, but doing a bit of plumbing to fix that seems preferable to trying to estimate them as you suggest.

FWIW these numbers are effectively the same as the ones we need to solve https://github.com/elastic/elasticsearch/issues/97819.

Mar 29 '24 00:03 DaveCTurner

Hi, I think I found one area where this stat is accessible (via IndexRequest), and fetching it before and after running the request is very straightforward (in this method I think would be easiest). But the source in the IndexRequest only gets updated after the entire request has finished running, so it wouldn't work for getting stats on a pipeline-basis or on a processor-basis, only for the entire node. I was wondering if this would be ok?

Mar 29 '24 19:03 limotova

Hmm yeah I see, we're manipulating the doc as a tree of objects rather than as its raw bytes, so we can only really measure the difference in size caused by the overall ingest pipeline process.

I think aggregating this at the node level will be too coarse to be useful. Although pipelines can chain themselves together, there's only one pipeline (plus maybe a final_pipeline) to start with. I think it'd work well to attribute the growth in doc size to that first pipeline name, since it's this pipeline which chooses what other pipelines to run later. Or maybe a string like [pipeline_name][final_pipeline_name]. I'm looking to @elastic/es-data-management for confirmation or alternative suggestions tho.

Mar 29 '24 19:03 DaveCTurner

I think it'd work well to attribute the growth in doc size to that first pipeline name, since it's this pipeline which chooses what other pipelines to run later. Or maybe a string like [pipeline_name][final_pipeline_name].

I think using the first pipeline name is a reasonable compromise here. It may not be as useful with high reroute usage, but it would still give an entry point for diagnosing things.

Mar 29 '24 22:03 dakrone

Let's say there are N pipelines that process a given document, and then it ends up in some index, like this:

pipeline 1 --> pipeline 2 --> ... --> pipeline N --> index X

It seems unfair to me to allocate all the byte deltas to pipeline 1, especially if it's a just some dispatcher pipeline. Alternatively, we could keep this in the stats associated with index X. It's not ideal (we'd love to track the bytes entering and exiting every pipeline), but at least this would be equally unfair to all pipelines (none of them record this information) rather than singling out the first pipeline and calling it special.

Apr 26 '24 19:04 joegallo

Yeah it's not ideal, but note that the first pipeline is the one which is specified in the indexing request (subject to defaults etc) and the one which chooses the subsequent pipelines, so IMO it makes sense to consider it responsible for its effects. From a supportability perspective, we want to be able to help users understand and control the difference between the size of data they're sending to ES and the size of the data it's indexing. If we answered those questions by pointing them at a specific index then they'd have to do some detective work to determine out the corresponding pipeline, and it may be that different clients are indexing using different pipelines into the same index. Conversely if we tell them the pipeline then they skip those steps, and there's no ambiguity in the multiple-pipelines-in-use case.

Apr 30 '24 09:04 DaveCTurner