OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Add experimental support for zstd and lz4 (native) compression algorithms.

Open mulugetam opened this issue 3 years ago • 61 comments

This PR adds experimental support for lz4 (native) and zstd (with and without dictionary) compression algorithms. The rationale for adding this support is discussed here: https://github.com/opensearch-project/OpenSearch/issues/3354.

Users would be able to set the compression type by setting the index.codec with the following values:

  • "lz4_native" -- lz4 (native) compression.
  • "zstd" -- zstd compression, with dictionary.
  • "zstd_no_dict" -- zstd compression, without dictionary.

The core classes are packaged under org.apache.lucene.codecs.experimental.

The compression algorithms have been tested and benchmarked using OpenSearch-Benchmark and StoredFieldsBenchmark . The results show significant gains in performance.

Signed-off-by: Mulugeta Mammo [email protected]

Description

A detailed description is available here: https://github.com/opensearch-project/OpenSearch/issues/3354

Issues Resolved

https://github.com/opensearch-project/OpenSearch/issues/3354

Check List

  • [x] New functionality includes testing.
    • [x] All tests pass
  • [ ] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • [ ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

mulugetam avatar Jun 14 '22 09:06 mulugetam

:x:   Gradle Check failure 9d3cbe0dbd8e41daef93385b6c77aa2ca6c95d6b Log 5973

Reports 5973

opensearch-ci-bot avatar Jun 14 '22 09:06 opensearch-ci-bot

Small thing so far:

  • You need to add signature files for new libs + licenses otherwise build fails :-) Take a look here: https://github.com/opensearch-project/OpenSearch/pull/2996 it will reduce my changes.
    ./gradlew precommit/check will help

willyborankin avatar Jun 14 '22 09:06 willyborankin

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

reta avatar Jun 14 '22 12:06 reta

@willyborankin I have added signature files and licenses for the dependencies.

mulugetam avatar Jun 14 '22 17:06 mulugetam

:x:   Gradle Check failure 759642278598f5f17fe12af5a6efa052e06e9972 Log 5987

Reports 5987

opensearch-ci-bot avatar Jun 14 '22 17:06 opensearch-ci-bot

@reta the PR is labelled "experimental" just because we'd like to make it available for users together with the snapshotting compression that @willyborankin has been doing. The codecs work well and have significant performance gains.

mulugetam avatar Jun 14 '22 17:06 mulugetam

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

I really don't know anything about this stuff, deferring to @nknize, at a high level, yes.

Overall this PR looks good to me.

dblock avatar Jun 14 '22 18:06 dblock

@mulugetam You should get it to green either way, fix DCO, etc.

dblock avatar Jun 14 '22 18:06 dblock

:x:   Gradle Check failure 3de2f6b7a00dfe4f700f2d45e314f7a1ca53c7fa Log 5997

Reports 5997

opensearch-ci-bot avatar Jun 14 '22 20:06 opensearch-ci-bot

:x:   Gradle Check failure 6999a5496e1955e4c632044cee86cddcdb50654b Log 5998

Reports 5998

opensearch-ci-bot avatar Jun 14 '22 20:06 opensearch-ci-bot

:white_check_mark:   Gradle Check success e95ae663dfd0cda0550d5bb7ffd589590b87312a Log 6019

Reports 6019

opensearch-ci-bot avatar Jun 15 '22 07:06 opensearch-ci-bot

:x:   Gradle Check failure 1a558e751816fb87ddf54b3e529c65ab1a8c3691 Log 6027

Reports 6027

opensearch-ci-bot avatar Jun 15 '22 16:06 opensearch-ci-bot

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

IMHO I think we can add them as a new experimental feature with default behaviour as it works now.

willyborankin avatar Jun 19 '22 19:06 willyborankin

@willyborankin have fixed the issues your raised. Thanks :-)

mulugetam avatar Jun 21 '22 08:06 mulugetam

:white_check_mark:   Gradle Check success 9c02870d30fb33ae000402d2ace647beba069188 Log 6176

Reports 6176

opensearch-ci-bot avatar Jun 21 '22 08:06 opensearch-ci-bot

:white_check_mark:   Gradle Check success 566217cba00fa09825b71c0b8e2dcff414a1c371 Log 6177

Reports 6177

opensearch-ci-bot avatar Jun 21 '22 08:06 opensearch-ci-bot

:white_check_mark:   Gradle Check success 11376fadc35b0e83b90bbfe4a5c621963e0c48f9 Log 6191

Reports 6191

opensearch-ci-bot avatar Jun 21 '22 19:06 opensearch-ci-bot

For the context, it seems like the ZSTD part is largely coming from [1], the compression benefits are also confirmed in [2].

[1] https://github.com/apache/lucene/pull/439/ [2] https://issues.apache.org/jira/browse/LUCENE-8739

reta avatar Jun 22 '22 13:06 reta

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

IMHO I think we can add them as a new experimental feature with default behaviour as it works now.

The sandbox modules / plugins [1] were introduced to cover exactly the experimental features, like this one, keeping to core clean and stable.

[1] https://github.com/opensearch-project/OpenSearch/issues/570

reta avatar Jun 22 '22 13:06 reta

:x:   Gradle Check failure f3d61a03495ed370396b1a5c479b458f45cda713 Log 6268

Reports 6268

opensearch-ci-bot avatar Jun 23 '22 21:06 opensearch-ci-bot

For the context, it seems like the ZSTD part is largely coming from [1], the compression benefits are also confirmed in [2].

[1] apache/lucene#439 [2] https://issues.apache.org/jira/browse/LUCENE-8739

@reta my team is behind that Lucene PR.

mulugetam avatar Jun 23 '22 22:06 mulugetam

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

IMHO I think we can add them as a new experimental feature with default behaviour as it works now.

The sandbox modules / plugins [1] were introduced to cover exactly the experimental features, like this one, keeping to core clean and stable.

[1] #570

By the way, what is the current status and plan for the ZSTD snapshotting compression @willyborankin has been doing?

mulugetam avatar Jun 23 '22 23:06 mulugetam

By the way, what is the current status and plan for the ZSTD snapshotting compression @willyborankin has been doing?

The plan for ZSTD for snapshots is different in a sense that there are 0 changes in core, only in plugins (that's the suggestion for this pull request as well). The status, afaik it has some pending comments and review request(s).

reta avatar Jun 23 '22 23:06 reta

:x:   Gradle Check failure 8e5ae51aca5fb1d87f55d2321cfbc57b5264c55b Log 6286

Reports 6286

opensearch-ci-bot avatar Jun 24 '22 01:06 opensearch-ci-bot

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

@reta Not sure how I can use the CodecService from inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?

mulugetam avatar Jun 24 '22 17:06 mulugetam

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

@reta Not sure how I can use the CodecService from inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?

Sure, we have EnginePlugin [1] which has a hook for CodecService:

    default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
        return Optional.empty();
    }

Basically, implementing sandbox plugin as an EnginePlugin allows to use a custom CodecService.

[1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51

reta avatar Jun 24 '22 17:06 reta

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

@reta Not sure how I can use the CodecService from inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?

Sure, we have EnginePlugin [1] which has a hook for CodecService:

    default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
        return Optional.empty();
    }

Basically, implementing sandbox plugin as an EnginePlugin allows to use a custom CodecService.

[1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51

@reta Was trying the suggestion and I see that PluginsService picks up the plugins in the plugins directory but not the one(s) in sandbox/plugins directory. Ideas?

mulugetam avatar Jun 25 '22 01:06 mulugetam

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

@reta Not sure how I can use the CodecService from inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?

Sure, we have EnginePlugin [1] which has a hook for CodecService:

    default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
        return Optional.empty();
    }

Basically, implementing sandbox plugin as an EnginePlugin allows to use a custom CodecService. [1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51

@reta Was trying the suggestion and I see that PluginsService picks up the plugins in the plugins directory but not the one(s) in sandbox/plugins directory. Ideas?

@mulugetam The plugins come as part of OpenSearch release distribution, sandbox are optional in a sense, but from installation perspective - those are just normal ZIP plugin bundles. Did you run into ``PluginsService` issue when adding tests?

reta avatar Jun 25 '22 13:06 reta

@dblock @nknize I think we should take the sandbox approach for experimental features. We have the ability to provide CodecService from the plugin(s), using sandbox plugin for that would make sense. Thoughts?

@reta Not sure how I can use the CodecService from inside sandbox/plugins. Can you please clarify? Do we have a sample plugin that demonstrates that?

Sure, we have EnginePlugin [1] which has a hook for CodecService:

    default Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
        return Optional.empty();
    }

Basically, implementing sandbox plugin as an EnginePlugin allows to use a custom CodecService. [1] https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/plugins/EnginePlugin.java#L51

@reta Was trying the suggestion and I see that PluginsService picks up the plugins in the plugins directory but not the one(s) in sandbox/plugins directory. Ideas?

@mulugetam The plugins come as part of OpenSearch release distribution, sandbox are optional in a sense, but from installation perspective - those are just normal ZIP plugin bundles. Did you run into ``PluginsService` issue when adding tests?

@reta No, I did not. But, I'm now able to use the EnginePlugin to create a custom NativeCodecService that extends from org.opensearch.index.codec.CodecService; and it works as expected. All my sources files reside in sandbox/modules. The problem I have right now is getting around the EngineConfig.INDEX_CODEC_SETTINGS which throws an exception for my new index.codec values ("lz4_native", "zstd", and "zstd_no_dict"). Not sure how to get around it without directly adding my new values in the switch statement. Ideas?

mulugetam avatar Jun 29 '22 21:06 mulugetam

Not sure how to get around it without directly adding my new values in the switch statement. Ideas?

@mulugetam Hm ... this is weird, the Codec should be picked from META-INF/services/org.apache.lucene.codecs.Codec which you bundle with the plugin. EngineConfig.INDEX_CODEC_SETTINGS does allow new codecs. If you have work-in-progress branch somewhere, I could take a look.

reta avatar Jun 30 '22 16:06 reta