OpenSearch
OpenSearch copied to clipboard
[Feature Request] Disable field data on "_id" field
Is your feature request related to a problem? Please describe
Right now , field data on "_id" field is enabled by default.
Users unaware of the implication of sorting on "_id" field , perform sorting on "_id" field on a large dataset and experience sudden increase in heap usage as field data of "_id" field will be cached to the default amount to 20% of heap. ( or based on the custom value )
public static final Setting<Boolean> INDICES_ID_FIELD_DATA_ENABLED_SETTING = Setting.boolSetting(
"indices.id_field_data.enabled",
true,
Property.Dynamic,
Property.NodeScope
);
"indices.fielddata.cache.size" is the setting which decides the cache size limit
Describe the solution you'd like
Make cluster setting default to 'false' , so that users take a conscious decision on enabling field data on "_id" field.
public static final Setting<Boolean> INDICES_ID_FIELD_DATA_ENABLED_SETTING = Setting.boolSetting(
"indices.id_field_data.enabled",
false,
Property.Dynamic,
Property.NodeScope
);
This will help the users continue using "_id" field for sorting and aggregations etc with just changing the cluster setting.
Related component
Search:Query Capabilities
Describe alternatives you've considered
No response
Additional context
No response
[Triage - attendees 1 2 3] @bharath-techie Thanks for filing this issue. As a triage team the proposed behavior change is breaking and could be controversial.
@reta @msfroh What are your thoughts on this issue, is there someone else that should be looped into to look at this proposal?
This would be a breaking change. To start with we can update the OpenSearch documentation to warn users about the implication of sorting on fields with high cardinality including _id. We can keep this issue open to gather feedback from community users and can decide if we should do this in 3.0.
What are your thoughts on this issue, is there someone else that should be looped into to look at this proposal?
I agree that it would a) be breaking and b) be a good idea. Given the opportunity for users to harm their cluster with _id
fielddata, IMO we should definitely disable it by default.
Sorting by _id
seems like a good option for a sorting tie-breaker, but users may not anticipate the cost in terms of heap usage. (If you really need to sort by ids as a tie-breaker, I would suggest writing the ids to a field with doc_values:true
, and sort on that.)