lshash icon indicating copy to clipboard operation
lshash copied to clipboard

locality sensitive hashing (LSHASH) for Python3

LSHash

:Version: 0.0.9 :Python: 3.11.5

A fast Python implementation of locality sensitive hashing with persistance support.

Based on original source code https://github.com/kayzhu/LSHash

Highlights

  • Python3 support
  • Load & save hash tables to local disk
  • Fast hash calculation for large amount of high dimensional data through the use of numpy arrays.
  • Built-in support for persistency through Redis.
  • Multiple hash indexes support.
  • Built-in support for common distance/objective functions for ranking outputs.

Installation

LSHash depends on the following libraries:

  • numpy
  • bitarray (if hamming distance is used as distance function)

Optional

  • redis (if persistency through Redis is needed)

To install from sources:

.. code-block:: bash

$ git clone https://github.com/loretoparisi/lshash.git
$ python setup.py install

To install from PyPI:

.. code-block:: bash

$ pip install lshashpy3
$ python -c "import lshashpy3 as lshash; print(lshash.__version__);"

Quickstart

To create 6-bit hashes for input data of 8 dimensions:

.. code-block:: python

create 6-bit hashes for input data of 8 dimensions:

lsh = LSHash(6, 8)

index vector

lsh.index([2,3,4,5,6,7,8,9])

index vector and extra data

lsh.index([10,12,99,1,5,31,2,3], extra_data="vec1") lsh.index([10,11,94,1,4,31,2,3], extra_data="vec2")

query a data point

top_n = 1 nn = lsh.query([1,2,3,4,5,6,7,7], num_results=top_n, distance_func="euclidean") print(nn)

unpack vector, extra data and vectorial distance

top_n = 3 nn = lsh.query([10,12,99,1,5,30,1,1], num_results=top_n, distance_func="euclidean") for ((vec,extra_data),distance) in nn: print(vec, extra_data, distance)

To save hash table to disk:

.. code-block:: python

lsh = LSHash(hash_size=k, input_dim=d, num_hashtables=L, storage_config={ 'dict': None }, matrices_filename='weights.npz', hashtable_filename='hash.npz', overwrite=True)

lsh.index([10,12,99,1,5,31,2,3], extra_data="vec1") lsh.index([10,11,94,1,4,31,2,3], extra_data="vec2") lsh.save()

To load hash table from disk and perform a query:

.. code-block:: python

lsh = LSHash(hash_size=k, input_dim=d, num_hashtables=L, storage_config={ 'dict': None }, matrices_filename='weights.npz', hashtable_filename='hash.npz', overwrite=True)

top_n = 3 nn = lsh.query([10,12,99,1,5,30,1,1], num_results=top_n, distance_func="euclidean") print(nn)

API

  • To initialize a LSHash instance:

.. code-block:: python

k = 6 # hash size L = 5 # number of tables d = 8 # Dimension of Feature vector LSHash(hash_size=k, input_dim=d, num_hashtables=L, storage_config={ 'dict': None }, matrices_filename='weights.npz', hashtable_filename='hash.npz', overwrite=True)

parameters:

hash_size: The length of the resulting binary hash. input_dim: The dimension of the input vector. num_hashtables = 1: (optional) The number of hash tables used for multiple lookups. storage = None: (optional) Specify the name of the storage to be used for the index storage. Options include "redis". matrices_filename = None: (optional) Specify the path to the .npz file random matrices are stored or to be stored if the file does not exist yet hashtable_filename = None: (optional) Specify the path to the .npz file hash table are stored or to be stored if the file does not exist yet overwrite = False: (optional) Whether to overwrite the matrices file if it already exist

  • To index a data point of a given LSHash instance, e.g., lsh:

.. code-block:: python

lsh.index(input_point, extra_data=None):

parameters:

input_point: The input data point is an array or tuple of numbers of input_dim. extra_data = None: (optional) Extra data to be added along with the input_point.

  • To query a data point against a given LSHash instance, e.g., lsh:

.. code-block:: python

lsh.query(query_point, num_results=None, distance_func="euclidean"):

parameters:

query_point: The query data point is an array or tuple of numbers of input_dim. num_results = None: (optional) The number of query results to return in ranked order. By default all results will be returned. distance_func = "euclidean": (optional) Distance function to use to rank the candidates. By default "euclidean" distance function will be used. Distance function can be "euclidean", "true_euclidean", "centred_euclidean", "cosine", "l1norm".

  • To save the hash table currently indexed:

.. code-block:: python

lsh.save():