milvus icon indicating copy to clipboard operation
milvus copied to clipboard

feat: optimize `Like` query with n-gram

Open SpadeA-Tang opened this issue 6 months ago • 21 comments

Ref #42053

This is the first PR for optimizing LIKE with ngram inverted index. Now, only VARCHAR data type is supported and only InnerMatch LIKE (%xxx%) query is supported.

How to use it:

milvus_client = MilvusClient("http://localhost:19530")
schema = milvus_client.create_schema()
...
schema.add_field("content_ngram", DataType.VARCHAR, max_length=10000, enable_ngram_index=True, min_gram=2, max_gram=4)
...
milvus_client.create_collection(COLLECTION_NAME, ...)

min_gram and max_gram controls how we tokenize the documents. For example, for min_gram=2 and max_gram=4, we will tokenize each document with 2-gram, 3-gram and 4-gram.

SpadeA-Tang avatar May 13 '25 08:05 SpadeA-Tang

@SpadeA-Tang

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

  1. Title Format: The PR title must begin with one of these prefixes:
  • feat: for introducing a new feature.
  • fix: for bug fixes.
  • enhance: for improvements to existing functionality.
  • test: for add tests to existing functionality.
  • doc: for modifying documentation.
  • auto: for the pull request from bot.
  1. Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly 

Please review and update your PR to comply with these guidelines.

mergify[bot] avatar May 13 '25 08:05 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar May 22 '25 17:05 mergify[bot]

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar May 22 '25 18:05 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar May 22 '25 18:05 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar May 23 '25 05:05 mergify[bot]

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar May 23 '25 08:05 mergify[bot]

Codecov Report

Attention: Patch coverage is 87.45981% with 78 lines in your changes missing coverage. Please review.

Project coverage is 78.86%. Comparing base (d4260b4) to head (fb4c641). Report is 32 commits behind head on master.

Files with missing lines Patch % Lines
...rnal/core/src/exec/expression/JsonContainsExpr.cpp 91.49% 21 Missing :warning:
internal/core/src/index/NgramInvertedIndex.cpp 82.75% 15 Missing :warning:
...ernal/core/src/indexbuilder/ScalarIndexCreator.cpp 23.07% 10 Missing :warning:
internal/core/src/exec/expression/UnaryExpr.cpp 86.76% 9 Missing :warning:
...ternal/util/indexparamcheck/ngram_index_checker.go 76.66% 5 Missing and 2 partials :warning:
internal/core/src/index/IndexFactory.cpp 73.33% 4 Missing :warning:
internal/datacoord/stats_inspector.go 57.14% 3 Missing :warning:
internal/querynodev2/segments/segment_loader.go 25.00% 3 Missing :warning:
internal/core/src/exec/expression/Utils.h 96.49% 2 Missing :warning:
internal/core/src/segcore/SegmentInterface.cpp 0.00% 2 Missing :warning:
... and 2 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #41803      +/-   ##
==========================================
- Coverage   80.46%   78.86%   -1.60%     
==========================================
  Files        1551     1557       +6     
  Lines      221379   222101     +722     
==========================================
- Hits       178128   175162    -2966     
- Misses      36860    40466    +3606     
- Partials     6391     6473      +82     
Components Coverage Δ
Client 79.39% <ø> (+0.14%) :arrow_up:
Core 73.76% <88.79%> (+0.89%) :arrow_up:
Go 79.82% <69.04%> (-2.15%) :arrow_down:
Files with missing lines Coverage Δ
...ernal/core/src/exec/expression/BinaryRangeExpr.cpp 87.05% <100.00%> (-0.11%) :arrow_down:
internal/core/src/exec/expression/UnaryExpr.h 77.69% <ø> (ø)
internal/core/src/index/IndexFactory.h 100.00% <ø> (ø)
internal/core/src/index/IndexInfo.h 100.00% <ø> (ø)
...ernal/core/src/index/JsonKeyStatsInvertedIndex.cpp 92.30% <ø> (ø)
...ternal/core/src/segcore/ChunkedSegmentSealedImpl.h 54.76% <100.00%> (+3.47%) :arrow_up:
internal/core/src/segcore/SegmentInterface.h 73.46% <ø> (ø)
internal/core/src/segcore/SegmentSealed.h 93.33% <ø> (ø)
internal/core/src/segcore/load_index_c.cpp 42.77% <100.00%> (+19.67%) :arrow_up:
internal/core/src/storage/DiskFileManagerImpl.cpp 66.12% <100.00%> (+1.66%) :arrow_up:
... and 17 more

... and 291 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 23 '25 13:05 codecov[bot]

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar May 23 '25 14:05 mergify[bot]

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar May 26 '25 04:05 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Jun 04 '25 08:06 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 17 '25 09:06 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 17 '25 10:06 mergify[bot]

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Jun 17 '25 10:06 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Jun 17 '25 12:06 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 18 '25 03:06 mergify[bot]

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Jun 18 '25 03:06 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Jun 18 '25 04:06 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 18 '25 09:06 mergify[bot]

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Jun 18 '25 09:06 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Jun 18 '25 12:06 mergify[bot]

this PR looks great to me, but we are planning to make ngram an explicitly built index on a given field, instead of stats. it would be much more flexible if the user can create/load/drop the ngram index as needed. If the user must specify the min/max gram in the schema, it would be unchangable once data are inserted.

how hard would it be to modify this? I assume should be pretty easy?

zhengbuqian avatar Jun 22 '25 14:06 zhengbuqian

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 25 '25 04:06 mergify[bot]

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Jun 25 '25 06:06 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Jun 25 '25 06:06 mergify[bot]

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Jun 25 '25 12:06 mergify[bot]

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Jun 25 '25 15:06 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 27 '25 10:06 mergify[bot]

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Jun 27 '25 10:06 mergify[bot]

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Jun 30 '25 11:06 mergify[bot]

rerun go-sdk

SpadeA-Tang avatar Jun 30 '25 11:06 SpadeA-Tang