usearch icon indicating copy to clipboard operation
usearch copied to clipboard

Bug: unsigned uint8 misbehaves when building an index

Open liquidcarbon opened this issue 8 months ago • 10 comments

Describe the bug

Why does the index and and distance calculations become all zeroes?

Steps to reproduce

index = Index(ndim=3)
a = np.uint8([
    [0, 0, 1],
    [0, 1, 2],
    [1, 2, 3],
])
index.add([0,1,2], a)
for i in range(3):
    print(index[i])
pd.DataFrame([r for r in index.search(a, 4)])

Image

Expected behavior

If you do this with DuckDB:

df = pd.DataFrame({"idx": [0,1,2], "vec": [v for v in a]})
duckdb.sql("""
SELECT a.idx, b.idx, LIST_DISTANCE(a.vec, b.vec)
FROM df a JOIN df b ON 1=1
""").df()

Image

USearch version

2.17.7

Operating System

Amazon Linux

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • [ ] I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • [x] I have searched the existing issues

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

liquidcarbon avatar Apr 22 '25 16:04 liquidcarbon

@liquidcarbon hey! Try explicitly setting the preferred metric and internal representation type in the constructor of the index 🤗

ashvardanian avatar Apr 22 '25 16:04 ashvardanian

I've tried a few things; neither dtype nor metric seem to make a difference?

Image

Looks like some rescaling is happening here:

Image

liquidcarbon avatar Apr 22 '25 16:04 liquidcarbon

Is it same for types like f32 and f16?

ashvardanian avatar Apr 22 '25 16:04 ashvardanian

Yes

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 1., 0.]], dtype=float32)

liquidcarbon avatar Apr 22 '25 16:04 liquidcarbon

Additional context: I have a large Parquet dataset with vector column written as 1024-dim np.uint8 vectors, of which typically around 50-100 are non-zeroes.

I was trying to build an index with usearch, and the search results didn't make sense. Then I noticed that in the index there remained only a few (under 10) non-zero values in the vectors.

Amazon Linux 2023.6.20241010; r7i-large instance, if this helps

liquidcarbon avatar Apr 22 '25 16:04 liquidcarbon

The reason for uint8 was to use feature counts; I have no intuition whether using counts is any better than using bits (seems to be the go-to method). But I figured one can always turn uint counts to bits, but not the other way around.

liquidcarbon avatar Apr 22 '25 18:04 liquidcarbon

Fun fact: uint8 causes trouble but int8 works

liquidcarbon avatar Apr 22 '25 20:04 liquidcarbon

That's a good hint, @liquidcarbon! The u8 support was added somewhat recently, if I remember correctly, and some of the tests were not extended to cover it. Would you be able to extend the existing test_index.py tests for for i8 to also have a u8 variant, and PR it?

ashvardanian avatar Apr 22 '25 21:04 ashvardanian

I'll take a look but if the root cause is in on the C side I must bow out :)

liquidcarbon avatar Apr 22 '25 21:04 liquidcarbon

I'll take over the C patches, but having it covered with tests on the Python will be a good starting point for me. Thanks, @liquidcarbon!

ashvardanian avatar Apr 22 '25 21:04 ashvardanian