elasticsearch Include doc size info in ingest stats

This commit adds stats to the pipeline portion of ingest stats to show how many bytes were consumed and produced by a pipeline as the first pipeline in a chain of pipelines.

Closes #106386

Apr 08 '24 20:04 limotova

Pinging @elastic/es-data-management (Team:Data Management)

Apr 08 '24 22:04 elasticsearchmachine

Hey @limotova, not sure if you saw this comment on the attached issue, but it suggests an alternative way to aggregate the ingest byte stats. We are going to discuss the alternatives later this week and I'll get back to you regarding how it affects this PR.

Apr 29 '24 20:04 parkertimmins

We need to figure out how this behaves in the failure and drop cases.

May 02 '24 15:05 joegallo

@joegallo if there's a failure then there's no change in bytesProduced (bytesIngested still increases). But I'm not sure I understand what you mean by drop?

May 02 '24 18:05 limotova

@limotova That seems correct behavior for failures, since we still want to record the ingested bytes even if there are no produced bytes. Though I'd like to get @joegallo 's opinion. I believe Joe was talking about the drop processor. I just tested it, and the produced bytes is the same as the ingested bytes. Meaning the postIngest endpoint is still getting called, despite the document getting dropped. It seems that we should make this behave the way that it currently behaves for failures.

Just to clarify, this is the behavior I see for failures and drop, for a 100 bytes document:

drop
- ingested: 100
- produced: 100
failure
- ingested: 100
- produced: 0

May 02 '24 18:05 parkertimmins

If we do want drops to behave like failures, it look like we can just move the postIngest call into an else statement of the preceding if clause

May 02 '24 18:05 parkertimmins

Hey @limotova, thanks for continuing to work on this! There's been lots of discussion over whether there's a better way overall to emit these stats since allocating bytes to the "first" pipeline is a bit confusing. The consensus is that aggregating by index wouldn't be as useful as using the first pipeline. So we're moving ahead with your design.

I've learned a bunch in these discussions, so have several more suggested changes; after these, we'll be good to go. I'll add some more comments to the code shortly, but a few suggestions don't relate to the already changed files:

We'll need to update docs with the new fields, looks like there are two locations that need to be updated:
- https://github.com/elastic/elasticsearch/blob/main/docs/reference/cluster/nodes-stats.asciidoc?plain=1#L2645
- https://github.com/elastic/elasticsearch/blob/main/docs/reference/cluster/cluster-info.asciidoc?plain=1#L209
- Both fields should be listed as optional
- We'll want to be very clear with the explanation since attributing the first pipeline is confusing. I'll work on some phrasing and get back to you with this.
Add yaml tests
- Can add test here: https://github.com/elastic/elasticsearch/blob/fb84a22a754eca18d23d1375888d1469dfc1d984/modules/ingest-common/src/yamlRestTest/resources/rest-api-spec/test/ingest/15_info_ingest.yml#L67
- On that test can just assert that the new field values are greater than 0
- It would be nice to add some new tests for the following:
  - test that the drop processor only increments ingested and not produced
  - test that a failure only increments ingested and not produced (one easy way to make an failure is to make a pipeline processor to a pipeline that doesn't exist)
  - test that other pipelines do have the added fields. Could use a default and final pipeline, and verify that only the final default pipeline has the new fields
  - I have trouble sometimes running specific yaml test files, so just run all within a modules, which doesn't take too long: ./gradlew :modules:ingest-common:yamlRestTest --tests "org.elasticsearch.ingest.common.IngestCommonClientYamlTestSuiteIT"
- Yaml test are kind of tricky, but once are pretty cool once you are familiar with the pattern. Let me know if anything is tripping you up regarding the yaml tests.

May 07 '24 20:05 parkertimmins

Hi @parkertimmins, just had a few questions:

Both fields should be listed as optional

I'm not quite sure what you mean by this; since the stats are a response and not a request, the fields should show up on all pipelines, but they'll be set to 0 on pipelines that haven't been used as the first pipeline. Do we want them to not show up at all when they're both 0?

test that a failure only increments ingested and not produced (one easy way to make an failure is to make a pipeline processor to a pipeline that doesn't exist)

This is actually already tested here, should I still make this into a yaml test?

test that other pipelines do have the added fields. Could use a default and final pipeline, and verify that only the final default pipeline has the new fields

I'm not sure I fully understand, did you mean "verify that only the ~final~ default pipeline has the new fields"?

Also should I add the other yaml tests as new files? And add the drop processor test as a unit test?

May 08 '24 05:05 limotova

Good questions, some of what I said didn't make a whole lot of sense.

I'm not quite sure what you mean by this; since the stats are a response and not a request, the fields should show up on all pipelines, but they'll be set to 0 on pipelines that haven't been used as the first pipeline. Do we want them to not show up at all when they're both 0?

Yes, we should not show the fields at all if they are both 0, as we want to only include these fields for the "first" pipeline. The concern is that if we include them but leave them as 0 for other pipelines, it will give the impression that the pipeline is not being used. The term "optional" is a bit weird for response parameters, but I see that we use it elsewhere for response parameters that are not always present, so let's do that in the api docs. Eg "(Optional, string)" for response status field on https://www.elastic.co/guide/en/elasticsearch/reference/current/health-api.html

This is actually already tested here, should I still make this into a yaml test?

Good point, in that case, I'd say don't add a yaml test for the failure case. And yes, agree that it's better to make a drop processor unit test rather than yaml test. For the drop processor unit test, just copy the lines for the failure test case and paste them right below the failure test, changing the processor to a drop processor (I think that should work, though I'm not 100% sure that all the stats should behave the same if a document drops vs when it fails)

I'm not sure I fully understand, did you mean "verify that only the final default pipeline has the new fields"?

Sorry, I think I left out some words there! We want to index documents with both a final pipeline and a default pipeline (so we can verify that the default pipeline has the new stats, and the final pipeline does not). These are set in an index settings as default_pipeline and final_pipeline. In the existing yaml test in that file, the index is created implicitly when it documents are inserted with a bulk request. Instead, we can create the index first and set default_pipeline and final_pipeline, similar to here . The existing test only creates one pipeline, but we'll need a second one for the final pipeline.

Also, it's worth mentioning that rather than setting a default pipeline, the existing test case specifies which pipeline to use in the bulk request: '{"create": {"pipeline" : "ingest_info_pipeline"}}'. Since you'll set this with the default_pipeline setting, you'll want to remove the pipeline from the bulk request, for example change it to: '{"create": {}}'

Also should I add the other yaml tests as new files? And add the drop processor test as a unit test?

You can put the yaml test in the same file. Everything from this line down is a single test case. You can copy it and paste it below to modify for our use case.

May 08 '24 16:05 parkertimmins

@elasticmachine test this please

May 10 '24 15:05 parkertimmins

@limotova Thanks for making those changes. I'm still reviewing them, but overall they look great. Also, I still need to get back to you about the wording for the docs. I just learned that I needed to manually run the test suite (see my previous comment). It looks like backwards compatibility tests are failing ... likely just due to the conflict in TransportVersions.java. Could you merge the current main back into your feature branch? (don't worry about rebasing or anything like that, since we'll squash all the commits in the feature branch before merging the feature into main)

May 10 '24 15:05 parkertimmins

Another thing to make the CI build happy: could you add a file called docs/changelog/107240.yaml with the contents:

pr: 107240
summary: Include doc size info in ingest stats
area: Ingest Node
type: enhancement
issues:
 - 106386

May 10 '24 16:05 parkertimmins

@parkertimmins I merged and added the changelog! Had to make one extra adjustment as a result of merging and I also realized I missed a test that needed to be changed so I made those fixes as well

May 10 '24 22:05 limotova

@limotova Sounds good, thanks!

May 10 '24 23:05 parkertimmins

@elasticmachine test this please

May 10 '24 23:05 parkertimmins

@elasticsearchmachine test this please

May 15 '24 14:05 parkertimmins

@elasticsearchmachine test this please

May 16 '24 22:05 parkertimmins

@elasticsearchmachine test this please

May 17 '24 01:05 parkertimmins

elasticsearch elasticsearch copied to clipboard

Include doc size info in ingest stats

elasticsearch
elasticsearch copied to clipboard