elasticsearch
elasticsearch copied to clipboard
Adding aggregations support for the `_ignored` field
Closes #59946
The actual implementation needs to take into account how we decide to implement #101153.
One of the original requirements was to stop making the _ignored
field stored while adding doc_values
to it. However, the current proposal of #101153 assumes that we can access the original order of the field's content, which means we must keep it stored as well. The eventual approach will determine the implementation of this change.
Hi @eyalkoren, I've created a changelog YAML for you.
@javanna please take a look to see if this is the right direction. Some specific input I'd be happy to get:
- Only for the sake of #59946, we don't need to keep the
_ignored
field stored. However, for #101153 we may need it. What do you think we should do as part of this PR? - Can you answer @felixbarny's question? Can you provide an advice of how to create a test for that? EDIT: I proposed a test scenario, let me know whether you agree that it covers what we are looking for
Only for the sake of https://github.com/elastic/elasticsearch/issues/59946, we don't need to keep the _ignored field stored. However, for https://github.com/elastic/elasticsearch/issues/101153 we may need it. What do you think we should do as part of this PR?
I am leaning towards removing the stored field, but I think this requires further discussion, because it is going to be very difficult to go back to relying on the order if we stop storing the field. We said we won't focus for now on the ignored reason, and I would add that we should try not to have positional logic around it when we do so. Could you make the changes necessary to retrieve the field from doc values and stop storing the field in the meantime?
Can you provide an advice of how to create a test for that? EDIT: I proposed a test scenario, let me know whether you agree that it covers what we are looking for
I am not sure that we need to test for that. We know we are going to have partial failures in that case.
Thanks for the feedback @javanna π
Could you make the changes necessary to retrieve the field from doc values and stop storing the field in the meantime?
I will give it a try π
I am not sure that we need to test for that. We know we are going to have partial failures in that case.
OK, then I guess the next question is - what do we do about that? Only document it as a caveat related to aggregating on existing data streams? Something else?
OK, then I guess the next question is - what do we do about that? Only document it as a caveat related to aggregating on existing data streams? Something else?
IMO, it would be enough to document that in the same place where you document that the field is supports aggregations now.
@javanna the current state is non-functional, but I pushed so you can see what I have done so far.
I'd be happy for advice on how I can debug the fetch phase and specifically the invocation of valueFetcher()
. I didn't find a unit test that invokes it yet...
What I know is that the FetchDocValuesContext
that comes from the FetchContext
is null
and that the doc values list that comes from the SearchSourceBuilder
is also null
. Not sure if this is any help...
Having a look at this before starting to work on it and complete what is (maybe) missing.
My understanding is that in future we would like to enable synthetic source for logs too. Actually we would need to enable it sooner rather than later to avoid that introducing it later causes backward compatibility issues.
If the goal is to move logs to synthetic source we would not be able to keep the original order anyway for values of the _ignored
field. Moreover, we would also lose duplicates (in my understanding this is not really an issue).
If keeping the order is not an issue I guess we can start not making the field stored. @javanna agree?
@felixbarny is this an issue? Why do we need to preserve ordering?
I think itβs ok to not preserve the order.
Agreed, let's remove the stored field and rely on doc_values for retrieval instead, with the known limitations.
@javanna @felixbarny I am pushing a commit which makes the field 'stored' again because I would like to see if that makes the CI green.
My understanding is that metadata fields like _ignored
need to be stored to be correctly returned in the response. If they are not they will be missing and YAML tests will fail. If the CI is green I will investigate further how the fetch phase works when dealing with metadata fields which are not stored.
Almost there...after making the field stored (and including doc values) I see all tests green but the following one:
https://gradle-enterprise.elastic.co/s/aqxgf3itnzdvc/tests/:server:test/org.elasticsearch.index.mapper.IgnoredFieldMapperTests/testFetchIgnoredFieldValue
I need to have a look at the fetch phase and specifically how metadata fields are handled. Probably will need to move loading of the _ignored
field from the StoredFieldsPhase
to FetchDocValuesPhase
.
@salvatore-campagna you need to switch the field to b retrieved from doc_values instead of from stored fields. That is controlled by the valueFetcher
method in the mapped field type impl. It currently uses a StoredValueFetcher,
which you should be able to replace with a DocValueFetcher
. We do need to handle bw comp though, because indices created before this change will need to still retrieve from stored fields.
@salvatore-campagna you need to switch the field to b retrieved from doc_values instead of from stored fields. That is controlled by the
valueFetcher
method in the mapped field type impl. It currently uses aStoredValueFetcher,
which you should be able to replace with aDocValueFetcher
. We do need to handle bw comp though, because indices created before this change will need to still retrieve from stored fields.
@javanna as far as I understand we have two types in the mapper (to deal with backward compatibility), IgnoredFieldType
and LegacyIgnoredFieldType
the first one using a ValueFetcher
which relies on doc values, the second one using a ValueFetcher
relying on the field being stored
.
See that I use a DocValueFetcher
.
return new DocValueFetcher(docValueFormat(format, null), context.getForField(this, FielddataOperation.SEARCH));
My debugging shows that, for some reason, ValueFetcher
is not used for metadata fields...and if that is the case I am not aware of the reason.
Got it, I believe your assessment is correct. That is because metadata fields that are stored like _ignored
or _source
are fetched via a separate sub fetch phase, see https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/search/fetch/subphase/StoredFieldsPhase.java . The field needs to be removed from there, but under a version conditional. Next, can you check if fetching the field explicitly using the fields parameter of the search API gets you the value through the value fetcher? I suspect we will then need to figure out how to make it appear at the same level as _source in the search response by default, but pulling from doc_values (via value fetcher) and not from stored.
Based on my investigation and after talking to @javanna fetching metadata fields is, at the moment expected to happen in StoredFieldPhase
as a result of the field being stored, which has been the default so far. This means that making the field non-stored the field is not there when the StoredFieldPhase
tries to load stored fields.
Anyway there are other phases loading data from doc values, such as FetchVersionPhase
and SeqNoPrimaryTermsPhase
, which I plan to "mimic" to be able to load the _ignored
field from doc values.
From a performance standpoint I do not see possible performance issues as doc ids are iterated on a per-segment basis which means we do not incur the penalty of loading (and decompressing block) data from sparse Lucene segments (we always load data from all documents in a segment before moving to the next segment).
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-storage-engine (Team:StorageEngine)
@elasticsearchmachine run elasticsearch-ci/part-1 please
buildkite test this
I am stuck here...after spending 3 days on this I can't figure out a way to make this work fixing bwc tests...whenever I fix something I break something else.
@salvatore-campagna I can look into the BWC tests.
I think the issue might be with old indices created by versions before 8.14. Those indices still have the _ignored field as stored and we need to keep storing it as stored field to make sure that given an index the _ignored field has only stored fields (before 8.14) or doc value fields (from 8.14 on). If version 8.14 stops writing stored fields to an index that has been created by a version before 8.14 we end up with an index that has the field stored for some documents and not stored for all documents indexed by an 8.14 node.
buildkite test this
Ideally we should stop using skip
in yaml tests for things that require versions above 8.14
and move to using cluster features. Anyway I see there are more yaml tests still using skip
and I would prefer to handle those in a separate PR so that we apply changes to all yaml tests where that is required and avoid adding more stuff to this PR.