hnswlib
hnswlib copied to clipboard
Add other data storage types to Python bindings.
This is a bit of a big one - happy to break this into smaller PRs if that would be useful. This pull request:
- Adds new Python bindings for indexes with different data storage types:
-
DoubleIndex
(float64) -
Int8Index
-
UInt8Index
-
Int16Index
-
UInt16Index
-
- Adds another template argument to subclasses of
SpaceInterface
:data_t
, to specify the data storage type used. - Changes
get_items
to return a Numpy array instead of aList[List[data_t]]
. - Adds templated
InnerProduct
andL2Sqr
distance comparison functions that auto-unroll and auto-vectorize their inner loops (see Godbolt). This allows us to use different data types without manually having to write every comparison function. (This might make the manual SIMD functions obsolete, although I've left them in there for now out of an abundance of caution.) - Extends test coverage to cover these new classes.
I'm not 100% sure if this is the best way to architect this API; in particular, should the data type be a property of Index
, set at creation time, rather than a separate class of Index
? (For now, it's the latter.)
The existing tests seem to pass, and the new data types allow for smaller index files on disk (in situations where reduced-precision is acceptable):
The new int8
and uint8
data types even seem to perform about 60% faster in the best case (when using 1024 dimensions):
Hmm, it looks like this PR is failing on Windows but passing on Ubuntu (and on macOS, where I'm testing locally). I'll dig into that.
@psobot Curios if you are still working on this ? With this change , do you see performance impact on the old fp32 distance computation ?
@psobot : IIUC for the new storage types (uint8 , uint16 ...) ,it seems we rely on compiler vectorization , we dont have support for explicit vectorized code like fp32 ?