Bug: unsigned uint8 misbehaves when building an index
Describe the bug
Why does the index and and distance calculations become all zeroes?
Steps to reproduce
index = Index(ndim=3)
a = np.uint8([
[0, 0, 1],
[0, 1, 2],
[1, 2, 3],
])
index.add([0,1,2], a)
for i in range(3):
print(index[i])
pd.DataFrame([r for r in index.search(a, 4)])
Expected behavior
If you do this with DuckDB:
df = pd.DataFrame({"idx": [0,1,2], "vec": [v for v in a]})
duckdb.sql("""
SELECT a.idx, b.idx, LIST_DISTANCE(a.vec, b.vec)
FROM df a JOIN df b ON 1=1
""").df()
USearch version
2.17.7
Operating System
Amazon Linux
Hardware architecture
x86
Which interface are you using?
Python bindings
Contact Details
No response
Are you open to being tagged as a contributor?
- [ ] I am open to being mentioned in the project
.githistory as a contributor
Is there an existing issue for this?
- [x] I have searched the existing issues
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
@liquidcarbon hey! Try explicitly setting the preferred metric and internal representation type in the constructor of the index 🤗
I've tried a few things; neither dtype nor metric seem to make a difference?
Looks like some rescaling is happening here:
Is it same for types like f32 and f16?
Yes
array([[0., 0., 0.],
[0., 0., 0.],
[0., 1., 0.]], dtype=float32)
Additional context: I have a large Parquet dataset with vector column written as 1024-dim np.uint8 vectors, of which typically around 50-100 are non-zeroes.
I was trying to build an index with usearch, and the search results didn't make sense. Then I noticed that in the index there remained only a few (under 10) non-zero values in the vectors.
Amazon Linux 2023.6.20241010; r7i-large instance, if this helps
The reason for uint8 was to use feature counts; I have no intuition whether using counts is any better than using bits (seems to be the go-to method). But I figured one can always turn uint counts to bits, but not the other way around.
Fun fact: uint8 causes trouble but int8 works
That's a good hint, @liquidcarbon! The u8 support was added somewhat recently, if I remember correctly, and some of the tests were not extended to cover it. Would you be able to extend the existing test_index.py tests for for i8 to also have a u8 variant, and PR it?
I'll take a look but if the root cause is in on the C side I must bow out :)
I'll take over the C patches, but having it covered with tests on the Python will be a good starting point for me. Thanks, @liquidcarbon!