milvus
milvus copied to clipboard
feat: optimize `Like` query with n-gram
Ref #42053
This is the first PR for optimizing LIKE with ngram inverted index.
Now, only VARCHAR data type is supported and only InnerMatch LIKE (%xxx%) query is supported.
How to use it:
milvus_client = MilvusClient("http://localhost:19530")
schema = milvus_client.create_schema()
...
schema.add_field("content_ngram", DataType.VARCHAR, max_length=10000, enable_ngram_index=True, min_gram=2, max_gram=4)
...
milvus_client.create_collection(COLLECTION_NAME, ...)
min_gram and max_gram controls how we tokenize the documents. For example, for min_gram=2 and max_gram=4, we will tokenize each document with 2-gram, 3-gram and 4-gram.
@SpadeA-Tang
Invalid PR Title Format Detected
Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:
- Title Format: The PR title must begin with one of these prefixes:
feat:for introducing a new feature.fix:for bug fixes.enhance:for improvements to existing functionality.test: for add tests to existing functionality.doc: for modifying documentation.auto: for the pull request from bot.
- Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.
Required Title Structure:
[Type]: [Description of the PR]
Where Type is one of feat, fix, enhance, test or doc.
Example:
enhance: improve search performance significantly
Please review and update your PR to comply with these guidelines.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.
Codecov Report
Attention: Patch coverage is 87.45981% with 78 lines in your changes missing coverage. Please review.
Project coverage is 78.86%. Comparing base (
d4260b4) to head (fb4c641). Report is 32 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #41803 +/- ##
==========================================
- Coverage 80.46% 78.86% -1.60%
==========================================
Files 1551 1557 +6
Lines 221379 222101 +722
==========================================
- Hits 178128 175162 -2966
- Misses 36860 40466 +3606
- Partials 6391 6473 +82
| Components | Coverage Δ | |
|---|---|---|
| Client | 79.39% <ø> (+0.14%) |
:arrow_up: |
| Core | 73.76% <88.79%> (+0.89%) |
:arrow_up: |
| Go | 79.82% <69.04%> (-2.15%) |
:arrow_down: |
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...ernal/core/src/exec/expression/BinaryRangeExpr.cpp | 87.05% <100.00%> (-0.11%) |
:arrow_down: |
| internal/core/src/exec/expression/UnaryExpr.h | 77.69% <ø> (ø) |
|
| internal/core/src/index/IndexFactory.h | 100.00% <ø> (ø) |
|
| internal/core/src/index/IndexInfo.h | 100.00% <ø> (ø) |
|
| ...ernal/core/src/index/JsonKeyStatsInvertedIndex.cpp | 92.30% <ø> (ø) |
|
| ...ternal/core/src/segcore/ChunkedSegmentSealedImpl.h | 54.76% <100.00%> (+3.47%) |
:arrow_up: |
| internal/core/src/segcore/SegmentInterface.h | 73.46% <ø> (ø) |
|
| internal/core/src/segcore/SegmentSealed.h | 93.33% <ø> (ø) |
|
| internal/core/src/segcore/load_index_c.cpp | 42.77% <100.00%> (+19.67%) |
:arrow_up: |
| internal/core/src/storage/DiskFileManagerImpl.cpp | 66.12% <100.00%> (+1.66%) |
:arrow_up: |
| ... and 17 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
this PR looks great to me, but we are planning to make ngram an explicitly built index on a given field, instead of stats. it would be much more flexible if the user can create/load/drop the ngram index as needed. If the user must specify the min/max gram in the schema, it would be unchangable once data are inserted.
how hard would it be to modify this? I assume should be pretty easy?
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.
@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.
rerun go-sdk