switch to front-coded v1 bucket size 4 by default
Description
I think we should consider switching the IndexSpec default value of stringDictionaryEncoding to {"type":"frontCoded", "bucketSize":4, "formatVersion":1}.
Based on measurements #13854 things look pretty good and we have been running version 0 of the format for some time on a number of datasources without any notable performance loss, and version 1 for a smaller amount of time. I think by the time 27 is released it should be sufficiently baked in to feel confident about it being the default.
However, this means that upgrading from versions older than 26 will need special consideration, so it is important to call out in the release notes if we go forward with this.
Release note
Front coding was originally introduced in Druid 25.0, and an improved 'version 1' was introduced in Druid 26.0, with typically faster read speed and smaller storage size, has become the default in Druid 27.0. This means by default, segments created with Druid 27.0 are backwards compatible with Druid 26.0, but not compatible with Druid versions older than 26.0. If upgrading to Druid 27.0 from a version older than 26.0, the stringDictionaryEncoding should be set to {"type": "utf8"} to keep writing out the older format to enable seamless downgrades to Druid 25.0 and older, and then later is recommended to be changed to the new default once determined that rollback is not necessary.
This PR has:
- [x] been self-reviewed.
- [x] added documentation for new or modified features or behaviors.
- [x] a release note entry in the PR description.
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [x] been tested in a test Druid cluster.
People may find it difficult to update all their ingest specs, supervisors, etc, to ensure backwards compatibility. Is it possible to also add a runtime property that controls the default? That way, a cluster admin only has to set it in one place rather than track down all their users.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.