usearch icon indicating copy to clipboard operation
usearch copied to clipboard

Bug: Index.get() returns inconsistent values for non-existent key

Open 45deg opened this issue 1 year ago • 3 comments

Describe the bug

When calling Index.get() with a key that doesn't exist in the index, it sometimes returns the vector of some values, instead of consistently returning None as specified in the official documentation.

Steps to reproduce

Code to Reproduce

# Python 3.12
from usearch.index import Index
import numpy as nd
index = Index(ndim=10)
index.add(1, nd.array([0.5]*10))
index.add(2, nd.array([0.4]*10))
print(index.contains(1), index.get(1))
print(index.contains(2), index.get(2))
print(index.contains(3), index.get(3))
print(index.contains(102030), index.get(102030))

Output:

True [0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
True [0.3984375 0.3984375 0.3984375 0.3984375 0.3984375 0.3984375 0.3984375
 0.3984375 0.3984375 0.3984375]
False [0.3984375 0.3984375 0.3984375 0.3984375 0.3984375 0.3984375 0.3984375
 0.3984375 0.3984375 0.3984375]
False [0.3984375 0.3984375 0.3984375 0.3984375 0.3984375 0.3984375 0.3984375
 0.3984375 0.3984375 0.3984375]

The last two lines should be False None

Expected behavior

As document noted, it should return None, if one key is requested and its not present.

USearch version

2.15.1

Operating System

macOS Sonoma

Hardware architecture

Arm

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • [X] I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • [X] I have searched the existing issues

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

45deg avatar Sep 25 '24 03:09 45deg

Interestingly, trying to access the key -1 gives values which change over time, while too-large keys seems to give the last embedding again (as in the reporter's case)

x = usearch_index.Index(metric='IP', dtype=uindex.ScalarKind(12), ndim=3)
embs = np.float16(np.random.normal(scale=1e-4, size=[128, 3]))
x.add(np.arange(128), embs)

for i in (-2, -1, 0, 127, 128, 256):
  print(i)
  print(x.get(i))
  print(x.get(i))
  if i < 128:
    print(embs[i])

Output:

-2
[5.555e-05 4.506e-05 6.932e-05]
[5.555e-05 4.506e-05 6.932e-05]
[ 2.307e-05  5.108e-05 -3.988e-05]
-1
[2.307e-05 5.108e-05 3.988e-05]
[2.307e-05 5.108e-05 3.988e-05]
[ 7.74e-05  8.04e-05 -9.01e-05]
0
[-6.193e-05  7.033e-06  1.405e-04]
[-6.193e-05  7.033e-06  1.405e-04]
[-6.193e-05  7.033e-06  1.405e-04]
127
[ 7.74e-05  8.04e-05 -9.01e-05]
[ 7.74e-05  8.04e-05 -9.01e-05]
[ 7.74e-05  8.04e-05 -9.01e-05]
128
[7.74e-05 8.04e-05 9.01e-05]
[7.74e-05 8.04e-05 9.01e-05]
256
[7.74e-05 8.04e-05 9.01e-05]
[7.74e-05 8.04e-05 9.01e-05]

sdenton4 avatar Oct 15 '24 23:10 sdenton4

Root Cause Analysis

I've traced this bug to the C++ Python binding implementation in python/lib.cpp. The issue occurs in the get_typed_vectors_for_keys function when multi=False.

The Problem

When multi=False, the code allocates a numpy array and calls index.get() without checking if the key exists:

// python/lib.cpp lines 931-939
} else {
    py::array_t<external_at> result_py({keys_count, static_cast<Py_ssize_t>(index.scalar_words())});
    auto result_py2d = result_py.template mutable_unchecked<2>();
    for (Py_ssize_t task_idx = 0; task_idx != keys_count; ++task_idx) {
        dense_key_t key = *reinterpret_cast<dense_key_t const*>(keys_data + task_idx * keys_info.strides[0]);
        index.get(key, (internal_at*)&result_py2d(task_idx, 0), 1);  // ← Return value ignored!
    }
    return result_py;
}

The underlying C++ index.get() method returns false when a key doesn't exist (see index_dense.hpp lines 2194-2209), but this return value is completely ignored. This leaves uninitialized memory in the numpy array.

Compare this to the multi=True case (lines 913-930) which correctly handles non-existent keys:

if (!vectors_count) {
    results[task_idx] = py::none();
    continue;
}

Proposed Fix

The multi=False branch should check the return value of index.get() and handle non-existent keys appropriately:

} else {
    // For single-key case, return the vector or None directly
    if (keys_count == 1) {
        dense_key_t key = *reinterpret_cast<dense_key_t const*>(keys_data);
        py::array_t<external_at> result_py({1, static_cast<Py_ssize_t>(index.scalar_words())});
        auto result_py2d = result_py.template mutable_unchecked<2>();
        bool found = index.get(key, (internal_at*)&result_py2d(0, 0), 1);
        if (!found) {
            return py::none();
        }
        return result_py[0];  // Return the single vector, not a 2D array
    }
    
    // For multiple keys, return a tuple with None for missing keys
    py::tuple results(keys_count);
    for (Py_ssize_t task_idx = 0; task_idx != keys_count; ++task_idx) {
        dense_key_t key = *reinterpret_cast<dense_key_t const*>(keys_data + task_idx * keys_info.strides[0]);
        py::array_t<external_at> result_py({1, static_cast<Py_ssize_t>(index.scalar_words())});
        auto result_py2d = result_py.template mutable_unchecked<2>();
        bool found = index.get(key, (internal_at*)&result_py2d(0, 0), 1);
        if (!found) {
            results[task_idx] = py::none();
        } else {
            results[task_idx] = result_py[0];
        }
    }
    return results;
}

This fix ensures:

  1. Single non-existent keys return None (not uninitialized memory)
  2. Multiple key queries return a tuple with None for non-existent keys
  3. Behavior is consistent between multi=True and multi=False modes
  4. Matches the documented API contract

Why This Is Critical

This bug can lead to:

  • Security issues: Uninitialized memory might contain sensitive data
  • Unpredictable behavior: Values change between calls (as shown with key -1)
  • Silent data corruption: Applications may process garbage data thinking it's valid

The fix is straightforward - just check the return value that's already being provided by the underlying C++ method.

titusz avatar Sep 14 '25 10:09 titusz

Nice suggestions, @titusz! Can you please open a PR? I’ll merge a bit later today 🤗

ashvardanian avatar Sep 14 '25 11:09 ashvardanian