OpenSearch
OpenSearch copied to clipboard
Add experimental support for zstd and lz4 (native) compression algorithms.
This PR adds experimental support for lz4 (native) and zstd (with and without dictionary) compression algorithms. The rationale for adding this support is discussed here: https://github.com/opensearch-project/OpenSearch/issues/3354.
Users would be able to set the compression type by setting the index.codec with the following values:
"lz4_native"-- lz4 (native) compression."zstd"-- zstd compression, with dictionary."zstd_no_dict"-- zstd compression, without dictionary.
The core classes are packaged under org.apache.lucene.codecs.experimental.
The compression algorithms have been tested and benchmarked using OpenSearch-Benchmark and StoredFieldsBenchmark . The results show significant gains in performance.
Signed-off-by: Mulugeta Mammo [email protected]
Description
A detailed description is available here: https://github.com/opensearch-project/OpenSearch/issues/3354
Issues Resolved
https://github.com/opensearch-project/OpenSearch/issues/3354
Check List
- [x] New functionality includes testing.
- [x] All tests pass
- [ ] New functionality has been documented.
- [ ] New functionality has javadoc added
- [ ] Commits are signed per the DCO using --signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Small thing so far:
- You need to add signature files for new libs + licenses otherwise build fails :-)
Take a look here: https://github.com/opensearch-project/OpenSearch/pull/2996 it will reduce my changes.
./gradlew precommit/checkwill help
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?
@willyborankin I have added signature files and licenses for the dependencies.
@reta the PR is labelled "experimental" just because we'd like to make it available for users together with the snapshotting compression that @willyborankin has been doing. The codecs work well and have significant performance gains.
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?
I really don't know anything about this stuff, deferring to @nknize, at a high level, yes.
Overall this PR looks good to me.
@mulugetam You should get it to green either way, fix DCO, etc.
:white_check_mark: Â Gradle Check success e95ae663dfd0cda0550d5bb7ffd589590b87312a Log 6019
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?
IMHO I think we can add them as a new experimental feature with default behaviour as it works now.
@willyborankin have fixed the issues your raised. Thanks :-)
:white_check_mark: Â Gradle Check success 9c02870d30fb33ae000402d2ace647beba069188 Log 6176
:white_check_mark: Â Gradle Check success 566217cba00fa09825b71c0b8e2dcff414a1c371 Log 6177
:white_check_mark: Â Gradle Check success 11376fadc35b0e83b90bbfe4a5c621963e0c48f9 Log 6191
For the context, it seems like the ZSTD part is largely coming from [1], the compression benefits are also confirmed in [2].
[1] https://github.com/apache/lucene/pull/439/ [2] https://issues.apache.org/jira/browse/LUCENE-8739
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?IMHO I think we can add them as a new experimental feature with default behaviour as it works now.
The sandbox modules / plugins [1] were introduced to cover exactly the experimental features, like this one, keeping to core clean and stable.
[1] https://github.com/opensearch-project/OpenSearch/issues/570
For the context, it seems like the ZSTD part is largely coming from [1], the compression benefits are also confirmed in [2].
[1] apache/lucene#439 [2] https://issues.apache.org/jira/browse/LUCENE-8739
@reta my team is behind that Lucene PR.
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?IMHO I think we can add them as a new experimental feature with default behaviour as it works now.
The sandbox modules / plugins [1] were introduced to cover exactly the experimental features, like this one, keeping to core clean and stable.
[1] #570
By the way, what is the current status and plan for the ZSTD snapshotting compression @willyborankin has been doing?
By the way, what is the current status and plan for the ZSTD snapshotting compression @willyborankin has been doing?
The plan for ZSTD for snapshots is different in a sense that there are 0 changes in core, only in plugins (that's the suggestion for this pull request as well). The status, afaik it has some pending comments and review request(s).
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?
@reta Not sure how I can use the CodecService from inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?@reta Not sure how I can use the
CodecServicefrom inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?
Sure, we have EnginePlugin [1] which has a hook for CodecService:
default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
return Optional.empty();
}
Basically, implementing sandbox plugin as an EnginePlugin allows to use a custom CodecService.
[1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?@reta Not sure how I can use the
CodecServicefrom inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?Sure, we have
EnginePlugin[1] which has a hook forCodecService:default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) { return Optional.empty(); }Basically, implementing sandbox plugin as an
EnginePluginallows to use a customCodecService.[1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51
@reta Was trying the suggestion and I see that PluginsService picks up the plugins in the plugins directory but not the one(s) in sandbox/plugins directory. Ideas?
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?@reta Not sure how I can use the
CodecServicefrom inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?Sure, we have
EnginePlugin[1] which has a hook forCodecService:default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) { return Optional.empty(); }Basically, implementing sandbox plugin as an
EnginePluginallows to use a customCodecService. [1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51@reta Was trying the suggestion and I see that
PluginsServicepicks up the plugins in thepluginsdirectory but not the one(s) insandbox/pluginsdirectory. Ideas?
@mulugetam The plugins come as part of OpenSearch release distribution, sandbox are optional in a sense, but from installation perspective - those are just normal ZIP plugin bundles. Did you run into ``PluginsService` issue when adding tests?
@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide
CodecServicefrom the plugin(s), using sandbox plugin for that would make sense. Thoughts?@reta Not sure how I can use the
CodecServicefrom inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?Sure, we have
EnginePlugin[1] which has a hook forCodecService:default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) { return Optional.empty(); }Basically, implementing sandbox plugin as an
EnginePluginallows to use a customCodecService. [1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51@reta Was trying the suggestion and I see that
PluginsServicepicks up the plugins in thepluginsdirectory but not the one(s) insandbox/pluginsdirectory. Ideas?@mulugetam The
pluginscome as part of OpenSearch release distribution, sandbox are optional in a sense, but from installation perspective - those are just normal ZIP plugin bundles. Did you run into ``PluginsService` issue when adding tests?
@reta No, I did not. But, I'm now able to use the EnginePlugin to create a custom NativeCodecService that extends from org.opensearch.index.codec.CodecService; and it works as expected. All my sources files reside in sandbox/modules. The problem I have right now is getting around the EngineConfig.INDEX_CODEC_SETTINGS which throws an exception for my new index.codec values ("lz4_native", "zstd", and "zstd_no_dict"). Not sure how to get around it without directly adding my new values in the switch statement. Ideas?
Not sure how to get around it without directly adding my new values in the
switchstatement. Ideas?
@mulugetam Hm ... this is weird, the Codec should be picked from META-INF/services/org.apache.lucene.codecs.Codec which you bundle with the plugin. EngineConfig.INDEX_CODEC_SETTINGS does allow new codecs. If you have work-in-progress branch somewhere, I could take a look.