elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Adding aggregations support for the `_ignored` field

Open eyalkoren opened this issue 1 year ago β€’ 16 comments

Closes #59946

The actual implementation needs to take into account how we decide to implement #101153. One of the original requirements was to stop making the _ignored field stored while adding doc_values to it. However, the current proposal of #101153 assumes that we can access the original order of the field's content, which means we must keep it stored as well. The eventual approach will determine the implementation of this change.

eyalkoren avatar Oct 26 '23 11:10 eyalkoren

Documentation preview:

github-actions[bot] avatar Oct 26 '23 11:10 github-actions[bot]

Hi @eyalkoren, I've created a changelog YAML for you.

elasticsearchmachine avatar Oct 26 '23 11:10 elasticsearchmachine

@javanna please take a look to see if this is the right direction. Some specific input I'd be happy to get:

  • Only for the sake of #59946, we don't need to keep the _ignored field stored. However, for #101153 we may need it. What do you think we should do as part of this PR?
  • Can you answer @felixbarny's question? Can you provide an advice of how to create a test for that? EDIT: I proposed a test scenario, let me know whether you agree that it covers what we are looking for

eyalkoren avatar Nov 20 '23 15:11 eyalkoren

Only for the sake of https://github.com/elastic/elasticsearch/issues/59946, we don't need to keep the _ignored field stored. However, for https://github.com/elastic/elasticsearch/issues/101153 we may need it. What do you think we should do as part of this PR?

I am leaning towards removing the stored field, but I think this requires further discussion, because it is going to be very difficult to go back to relying on the order if we stop storing the field. We said we won't focus for now on the ignored reason, and I would add that we should try not to have positional logic around it when we do so. Could you make the changes necessary to retrieve the field from doc values and stop storing the field in the meantime?

Can you provide an advice of how to create a test for that? EDIT: I proposed a test scenario, let me know whether you agree that it covers what we are looking for

I am not sure that we need to test for that. We know we are going to have partial failures in that case.

javanna avatar Nov 24 '23 14:11 javanna

Thanks for the feedback @javanna πŸ™

Could you make the changes necessary to retrieve the field from doc values and stop storing the field in the meantime?

I will give it a try πŸ™‚

I am not sure that we need to test for that. We know we are going to have partial failures in that case.

OK, then I guess the next question is - what do we do about that? Only document it as a caveat related to aggregating on existing data streams? Something else?

eyalkoren avatar Nov 26 '23 17:11 eyalkoren

OK, then I guess the next question is - what do we do about that? Only document it as a caveat related to aggregating on existing data streams? Something else?

IMO, it would be enough to document that in the same place where you document that the field is supports aggregations now.

felixbarny avatar Nov 28 '23 08:11 felixbarny

@javanna the current state is non-functional, but I pushed so you can see what I have done so far. I'd be happy for advice on how I can debug the fetch phase and specifically the invocation of valueFetcher(). I didn't find a unit test that invokes it yet...

What I know is that the FetchDocValuesContext that comes from the FetchContext is null and that the doc values list that comes from the SearchSourceBuilder is also null. Not sure if this is any help...

eyalkoren avatar Nov 29 '23 16:11 eyalkoren

Having a look at this before starting to work on it and complete what is (maybe) missing.

My understanding is that in future we would like to enable synthetic source for logs too. Actually we would need to enable it sooner rather than later to avoid that introducing it later causes backward compatibility issues.

If the goal is to move logs to synthetic source we would not be able to keep the original order anyway for values of the _ignored field. Moreover, we would also lose duplicates (in my understanding this is not really an issue).

If keeping the order is not an issue I guess we can start not making the field stored. @javanna agree?

@felixbarny is this an issue? Why do we need to preserve ordering?

salvatore-campagna avatar Feb 05 '24 08:02 salvatore-campagna

I think it’s ok to not preserve the order.

felixbarny avatar Feb 05 '24 09:02 felixbarny

Agreed, let's remove the stored field and rely on doc_values for retrieval instead, with the known limitations.

javanna avatar Feb 05 '24 11:02 javanna

@javanna @felixbarny I am pushing a commit which makes the field 'stored' again because I would like to see if that makes the CI green.

My understanding is that metadata fields like _ignored need to be stored to be correctly returned in the response. If they are not they will be missing and YAML tests will fail. If the CI is green I will investigate further how the fetch phase works when dealing with metadata fields which are not stored.

salvatore-campagna avatar Feb 06 '24 14:02 salvatore-campagna

Almost there...after making the field stored (and including doc values) I see all tests green but the following one:

https://gradle-enterprise.elastic.co/s/aqxgf3itnzdvc/tests/:server:test/org.elasticsearch.index.mapper.IgnoredFieldMapperTests/testFetchIgnoredFieldValue

I need to have a look at the fetch phase and specifically how metadata fields are handled. Probably will need to move loading of the _ignored field from the StoredFieldsPhase to FetchDocValuesPhase.

salvatore-campagna avatar Feb 06 '24 15:02 salvatore-campagna

@salvatore-campagna you need to switch the field to b retrieved from doc_values instead of from stored fields. That is controlled by the valueFetcher method in the mapped field type impl. It currently uses a StoredValueFetcher, which you should be able to replace with a DocValueFetcher. We do need to handle bw comp though, because indices created before this change will need to still retrieve from stored fields.

javanna avatar Feb 06 '24 22:02 javanna

@salvatore-campagna you need to switch the field to b retrieved from doc_values instead of from stored fields. That is controlled by the valueFetcher method in the mapped field type impl. It currently uses a StoredValueFetcher, which you should be able to replace with a DocValueFetcher. We do need to handle bw comp though, because indices created before this change will need to still retrieve from stored fields.

@javanna as far as I understand we have two types in the mapper (to deal with backward compatibility), IgnoredFieldType and LegacyIgnoredFieldType the first one using a ValueFetcher which relies on doc values, the second one using a ValueFetcher relying on the field being stored.

See that I use a DocValueFetcher.

return new DocValueFetcher(docValueFormat(format, null), context.getForField(this, FielddataOperation.SEARCH));

My debugging shows that, for some reason, ValueFetcher is not used for metadata fields...and if that is the case I am not aware of the reason.

salvatore-campagna avatar Feb 07 '24 12:02 salvatore-campagna

Got it, I believe your assessment is correct. That is because metadata fields that are stored like _ignored or _source are fetched via a separate sub fetch phase, see https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/search/fetch/subphase/StoredFieldsPhase.java . The field needs to be removed from there, but under a version conditional. Next, can you check if fetching the field explicitly using the fields parameter of the search API gets you the value through the value fetcher? I suspect we will then need to figure out how to make it appear at the same level as _source in the search response by default, but pulling from doc_values (via value fetcher) and not from stored.

javanna avatar Feb 07 '24 19:02 javanna

Based on my investigation and after talking to @javanna fetching metadata fields is, at the moment expected to happen in StoredFieldPhase as a result of the field being stored, which has been the default so far. This means that making the field non-stored the field is not there when the StoredFieldPhase tries to load stored fields.

Anyway there are other phases loading data from doc values, such as FetchVersionPhase and SeqNoPrimaryTermsPhase, which I plan to "mimic" to be able to load the _ignored field from doc values. From a performance standpoint I do not see possible performance issues as doc ids are iterated on a per-segment basis which means we do not incur the penalty of loading (and decompressing block) data from sparse Lucene segments (we always load data from all documents in a segment before moving to the next segment).

salvatore-campagna avatar Feb 08 '24 15:02 salvatore-campagna

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Mar 07 '24 14:03 elasticsearchmachine

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine avatar Mar 07 '24 14:03 elasticsearchmachine

@elasticsearchmachine run elasticsearch-ci/part-1 please

salvatore-campagna avatar Mar 08 '24 16:03 salvatore-campagna

buildkite test this

salvatore-campagna avatar Mar 11 '24 12:03 salvatore-campagna

I am stuck here...after spending 3 days on this I can't figure out a way to make this work fixing bwc tests...whenever I fix something I break something else.

salvatore-campagna avatar Mar 11 '24 12:03 salvatore-campagna

@salvatore-campagna I can look into the BWC tests.

dnhatn avatar Mar 11 '24 15:03 dnhatn

I think the issue might be with old indices created by versions before 8.14. Those indices still have the _ignored field as stored and we need to keep storing it as stored field to make sure that given an index the _ignored field has only stored fields (before 8.14) or doc value fields (from 8.14 on). If version 8.14 stops writing stored fields to an index that has been created by a version before 8.14 we end up with an index that has the field stored for some documents and not stored for all documents indexed by an 8.14 node.

salvatore-campagna avatar Mar 12 '24 13:03 salvatore-campagna

buildkite test this

salvatore-campagna avatar Mar 13 '24 09:03 salvatore-campagna

Ideally we should stop using skip in yaml tests for things that require versions above 8.14 and move to using cluster features. Anyway I see there are more yaml tests still using skip and I would prefer to handle those in a separate PR so that we apply changes to all yaml tests where that is required and avoid adding more stuff to this PR.

salvatore-campagna avatar Apr 24 '24 12:04 salvatore-campagna