milvus
milvus copied to clipboard
[Bug]: INVERTED scalar filter has low precision in query/search
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.0
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: indexnode: 4x(2 CPU, 2GB); querynode: 2x(8CPU, 32GB)
- GPU: No
- Others:
Current Behavior
When running client.query(..., expr="my_ind == 1")
where my_ind
is of int type (tested w/ int16 and int32) and the index is INVERTED
, only a small (though statistically significant) fraction of the results satisfy the condition. Typical precision is 20-40% (with a 10% underlying density). STL_SORT
and no index both have 100% precision.
Expected Behavior
Either query(..., expr="my_ind == 1")
should have 100% precision, or the documentation should be updated to describe the expected behavior.
Steps To Reproduce
from pymilvus import FieldSchema, CollectionSchema, DataType, MilvusClient
import numpy as np
idx = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)
no_index = FieldSchema(name="no_index", dtype=DataType.INT16)
default_index = FieldSchema(name="default_index", dtype=DataType.INT16)
inv_index = FieldSchema(name="inv_index", dtype=DataType.INT16)
stl_index = FieldSchema(name="stl_index", dtype=DataType.INT16)
schema = CollectionSchema(fields=[idx, vector, no_index, default_index, inv_index, stl_index], auto_id=True)
client = MilvusClient()
client.drop_collection("index_test")
client.create_collection("index_test", schema=schema)
# Create (or remove) indices
index_params = client.prepare_index_params()
index_params.add_index(
field_name="default_index",
index_name="default_index"
)
index_params.add_index(
field_name="inv_index",
index_type="INVERTED",
index_name="inv_index"
)
index_params.add_index(
field_name="stl_index",
index_type="STL_SORT",
index_name="stl_index"
)
index_params.add_index(
field_name="vector",
index_type="IVF_SQ8",
metric_type="L2",
params={"nlist": 128},
)
client.create_index(
collection_name="index_test",
index_params=index_params
)
client.drop_index("index_test", "no_index")
# Make the collection large enough that the indexes are used
for _ in range(10000):
data = []
for _ in range(100):
data.append(
{
"vector": np.random.rand(128),
"no_index": np.random.randint(1000),
"default_index": np.random.randint(1000),
"inv_index": np.random.randint(1000),
"stl_index": np.random.randint(1000),
}
)
client.insert(
"index_test",
data=data,
)
for key in ["no_index", "default_index", "stl_index", "inv_index"]:
filt = f"{key} in {[i for i in range(1, 1000, 10)]}"
client.load_collection("index_test")
all_rows = client.query(
"index_test",
limit=128,
output_fields=["no_index", "default_index", "inv_index", "stl_index"],
filter=filt,
)
correct_rows = [row[key] for row in all_rows if row[key] % 10 == 1]
print(f"Index {key}: Total of {len(all_rows)} rows")
print(f"Index {key}: Total of {len(correct_rows)} correct rows")
### Milvus Log
_No response_
### Anything else?
Based on these results, I believe this documentation is also wrong, and that the default scalar index for v2.4 is `INVERTED`: https://milvus.io/docs/scalar_index.md#Default-indexing
/assign @longjiquan please help to take a look, meanwhile, i will try to reproduce it in house
INVERTED
@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.
@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.
Hi ~xiaofan-luan, thanks for helping look into this. There is no ground truth here per se, except for what I am requesting via the query. That is, if I perform a search and add the filter inv_index == 1
, I would expect every returned row to have inv_index == 1
. This is true of the STL index and the no-index case, but not for the inverted index.
could you share you code and what is the result you get?
I have reproduced the issue in house with the code above.
Index no_index: Total of 128 rows
Index no_index: Total of 128 correct rows
Index default_index: Total of 128 rows
Index default_index: Total of 49 correct rows
Index stl_index: Total of 128 rows
Index stl_index: Total of 128 correct rows
Index inv_index: Total of 128 rows
Index inv_index: Total of 49 correct rows
we can see that when filtering with the inverted field, it returns some results that do not in the filter list. e.g.
thanks for reporting the bug, @ghallsimpsons , already fixed in https://github.com/milvus-io/milvus/pull/32858
Very nice, thanks for the quick fix! I'll give it a go again when 2.4.2 is released.