Configurable Block Size for Stored Fields
Description
It was observed, for duplicated / similar type of data, the change in block size from 60K to 8K results in over 50% increase in the stored fields size.
This observation is coming from OpenSearch: https://github.com/opensearch-project/OpenSearch/issues/3769.
I was able to replicate the results for duplicated documents: https://github.com/opensearch-project/OpenSearch/issues/3769#issuecomment-1938506593. This also includes the comparison of non-similar data, where the affect of block size is mostly insignificant.
Now currently, if my understanding is correct, there is not a clean way to toggle the block sizes of the codecs without creating a separate Codec and a StoredFieldsFormat (I took a stab at this approach over here).
I would like to get community's feedback if we could provide a way to make the block size configurable which allows users to choices based on their type of workload.
The ability to configure Lucene file formats by adding tuning knobs has come up a few times in the past and the answer has been to create a custom codec, to keep the scope of what Lucene needs to maintain backward compatibility for contained.
The downside is that you're then on your own when it comes to maintaining backward compatibility for the data (which is typically ok for use-cases that don't need backward compatibility guarantees, e.g. because they reindex data daily from another source of data). But it sounds fair to require users to maintain backward compatibility themselves when Lucene's out-of-the-box file formats are not good for them.
If you want slower but better compression, maybe consider using BEST_COMPRESSION instead of BEST_SPEED? That is the purpose of the option...