pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Add SchemaConformingTransformerV2 to enhance text search abilities

Open lnbest0707 opened this issue 1 year ago • 1 comments

tags: feature, refactor, release-notes

This adds an evolved version of ShemaConformingTransformerV2, it evolves from the existing one with following new features:

Refactored code with better readability and extensibility Support over-lapping schema fields, in which case it could support schema column "a" and "a.b" at the same time. And it only allows primitive type fields to be the value. Extract flattened key-value pairs as mergedTextIndex for better text searching. Add shingle index tokenization functionality for extremely large text fields. Add flexibility to map json extracted field name to meaningful user specified column name Improve serialization logics to handle nested json fields Enforce graceful handling on extracted String type column. Will convert collection or array to String if column type is singleField.

The new transformer is contributed by multiple developers: @jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber and PR owner is summarizing and maintaining the OSS uploading.

lnbest0707 avatar Apr 03 '24 19:04 lnbest0707

Codecov Report

Attention: Patch coverage is 56.98630% with 157 lines in your changes are missing coverage. Please review.

Project coverage is 62.03%. Comparing base (59551e4) to head (b9a013e). Report is 218 commits behind head on master.

Files Patch % Lines
...ingestion/SchemaConformingTransformerV2Config.java 0.00% 78 Missing :warning:
...cordtransformer/SchemaConformingTransformerV2.java 73.51% 37 Missing and 30 partials :warning:
...recordtransformer/SchemaConformingTransformer.java 66.66% 0 Missing and 4 partials :warning:
.../apache/pinot/segment/local/utils/Base64Utils.java 80.00% 1 Missing and 1 partial :warning:
...ache/pinot/segment/local/utils/IngestionUtils.java 0.00% 0 Missing and 2 partials :warning:
...he/pinot/segment/local/utils/TableConfigUtils.java 50.00% 1 Missing and 1 partial :warning:
...ot/spi/config/table/ingestion/IngestionConfig.java 50.00% 2 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12788      +/-   ##
============================================
+ Coverage     61.75%   62.03%   +0.28%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2468      +32     
  Lines        133233   135237    +2004     
  Branches      20636    20892     +256     
============================================
+ Hits          82274    83892    +1618     
- Misses        44911    45145     +234     
- Partials       6048     6200     +152     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) :arrow_down:
integration <0.01% <0.00%> (-0.01%) :arrow_down:
integration1 <0.01% <0.00%> (-0.01%) :arrow_down:
integration2 0.00% <0.00%> (ø)
java-11 61.98% <56.98%> (+0.27%) :arrow_up:
java-21 61.86% <56.98%> (+0.23%) :arrow_up:
skip-bytebuffers-false 62.01% <56.98%> (+0.26%) :arrow_up:
skip-bytebuffers-true 61.83% <56.98%> (+34.10%) :arrow_up:
temurin 62.03% <56.98%> (+0.28%) :arrow_up:
unittests 62.02% <56.98%> (+0.28%) :arrow_up:
unittests1 46.52% <1.09%> (-0.37%) :arrow_down:
unittests2 28.14% <55.89%> (+0.40%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Apr 03 '24 20:04 codecov-commenter

Some extreme column name corner case checks would be taken care in future patch, e.g., column name with a., input data with {"a": {"b":1}, "a.b":2, "a.":3, "a.b.":4}

lnbest0707 avatar Apr 09 '24 18:04 lnbest0707