milvus [Bug]: INVERTED scalar filter has low precision in query/search

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.0
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: indexnode: 4x(2 CPU, 2GB); querynode: 2x(8CPU, 32GB)
- GPU: No
- Others:

Current Behavior

When running client.query(..., expr="my_ind == 1") where my_ind is of int type (tested w/ int16 and int32) and the index is INVERTED, only a small (though statistically significant) fraction of the results satisfy the condition. Typical precision is 20-40% (with a 10% underlying density). STL_SORT and no index both have 100% precision.

Expected Behavior

Either query(..., expr="my_ind == 1") should have 100% precision, or the documentation should be updated to describe the expected behavior.

Steps To Reproduce

from pymilvus import FieldSchema, CollectionSchema, DataType, MilvusClient
import numpy as np

idx = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)
no_index = FieldSchema(name="no_index", dtype=DataType.INT16)
default_index = FieldSchema(name="default_index", dtype=DataType.INT16)
inv_index = FieldSchema(name="inv_index", dtype=DataType.INT16)
stl_index = FieldSchema(name="stl_index", dtype=DataType.INT16)
schema = CollectionSchema(fields=[idx, vector, no_index, default_index, inv_index, stl_index], auto_id=True)
client = MilvusClient()
client.drop_collection("index_test")
client.create_collection("index_test", schema=schema)

# Create (or remove) indices
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="default_index",
    index_name="default_index"
)
index_params.add_index(
    field_name="inv_index",
    index_type="INVERTED",
    index_name="inv_index"
)
index_params.add_index(
    field_name="stl_index",
    index_type="STL_SORT",
    index_name="stl_index"
)
index_params.add_index(
    field_name="vector",
    index_type="IVF_SQ8",
    metric_type="L2",
    params={"nlist": 128},
)
client.create_index(
  collection_name="index_test",
  index_params=index_params
)
client.drop_index("index_test", "no_index")

# Make the collection large enough that the indexes are used
for _ in range(10000):
    data = []
    for _ in range(100):
        data.append(
            {
                "vector": np.random.rand(128),
                "no_index": np.random.randint(1000),
                "default_index": np.random.randint(1000),
                "inv_index": np.random.randint(1000),
                "stl_index": np.random.randint(1000),
            }
        )
    client.insert(
        "index_test",
        data=data,
    )

for key in ["no_index", "default_index", "stl_index", "inv_index"]:
    filt = f"{key} in {[i for i in range(1, 1000, 10)]}"
    client.load_collection("index_test")
    all_rows = client.query(
        "index_test",
        limit=128,
        output_fields=["no_index", "default_index", "inv_index", "stl_index"],
        filter=filt,
    )
correct_rows = [row[key] for row in all_rows if row[key] % 10 == 1]
print(f"Index {key}: Total of {len(all_rows)} rows")
print(f"Index {key}: Total of {len(correct_rows)} correct rows")



### Milvus Log

_No response_

### Anything else?

Based on these results, I believe this documentation is also wrong, and that the default scalar index for v2.4 is `INVERTED`: https://milvus.io/docs/scalar_index.md#Default-indexing

Apr 29 '24 17:04 ghallsimpsons

/assign @longjiquan please help to take a look, meanwhile, i will try to reproduce it in house

Apr 30 '24 09:04 yanliang567

INVERTED

@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.

May 05 '24 03:05 xiaofan-luan

@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.

Hi ~xiaofan-luan, thanks for helping look into this. There is no ground truth here per se, except for what I am requesting via the query. That is, if I perform a search and add the filter inv_index == 1, I would expect every returned row to have inv_index == 1. This is true of the STL index and the no-index case, but not for the inverted index.

May 06 '24 18:05 ghallsimpsons

could you share you code and what is the result you get?

May 07 '24 02:05 xiaofan-luan

I have reproduced the issue in house with the code above.

Index no_index: Total of 128 rows
Index no_index: Total of 128 correct rows
Index default_index: Total of 128 rows
Index default_index: Total of 49 correct rows
Index stl_index: Total of 128 rows
Index stl_index: Total of 128 correct rows
Index inv_index: Total of 128 rows
Index inv_index: Total of 49 correct rows

we can see that when filtering with the inverted field, it returns some results that do not in the filter list. e.g.

May 07 '24 03:05 yanliang567

thanks for reporting the bug, @ghallsimpsons , already fixed in https://github.com/milvus-io/milvus/pull/32858

May 08 '24 08:05 longjiquan

Very nice, thanks for the quick fix! I'll give it a go again when 2.4.2 is released.

May 08 '24 17:05 ghallsimpsons

milvus milvus copied to clipboard

[Bug]: INVERTED scalar filter has low precision in query/search

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

milvus
milvus copied to clipboard