OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Feature Request] Disable field data on "_id" field

Open bharath-techie opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe

Right now , field data on "_id" field is enabled by default.

Users unaware of the implication of sorting on "_id" field , perform sorting on "_id" field on a large dataset and experience sudden increase in heap usage as field data of "_id" field will be cached to the default amount to 20% of heap. ( or based on the custom value )

 public static final Setting<Boolean> INDICES_ID_FIELD_DATA_ENABLED_SETTING = Setting.boolSetting(
        "indices.id_field_data.enabled",
        true,
        Property.Dynamic,
        Property.NodeScope
    );

"indices.fielddata.cache.size" is the setting which decides the cache size limit

Describe the solution you'd like

Make cluster setting default to 'false' , so that users take a conscious decision on enabling field data on "_id" field.

 public static final Setting<Boolean> INDICES_ID_FIELD_DATA_ENABLED_SETTING = Setting.boolSetting(
        "indices.id_field_data.enabled",
        false,
        Property.Dynamic,
        Property.NodeScope
    );

This will help the users continue using "_id" field for sorting and aggregations etc with just changing the cluster setting.

Related component

Search:Query Capabilities

Describe alternatives you've considered

No response

Additional context

No response

bharath-techie avatar Feb 06 '24 10:02 bharath-techie

[Triage - attendees 1 2 3] @bharath-techie Thanks for filing this issue. As a triage team the proposed behavior change is breaking and could be controversial.

@reta @msfroh What are your thoughts on this issue, is there someone else that should be looped into to look at this proposal?

peternied avatar Feb 07 '24 16:02 peternied

This would be a breaking change. To start with we can update the OpenSearch documentation to warn users about the implication of sorting on fields with high cardinality including _id. We can keep this issue open to gather feedback from community users and can decide if we should do this in 3.0.

shwetathareja avatar Feb 09 '24 10:02 shwetathareja

What are your thoughts on this issue, is there someone else that should be looped into to look at this proposal?

I agree that it would a) be breaking and b) be a good idea. Given the opportunity for users to harm their cluster with _id fielddata, IMO we should definitely disable it by default.

Sorting by _id seems like a good option for a sorting tie-breaker, but users may not anticipate the cost in terms of heap usage. (If you really need to sort by ids as a tie-breaker, I would suggest writing the ids to a field with doc_values:true, and sort on that.)

msfroh avatar Feb 22 '24 23:02 msfroh