cuvs
cuvs copied to clipboard
SNMG ANN
This PR implements a distributed (single-node-multiple-GPUs) implementation of ANN indexes. It allows to build, extend and search an index on multiple GPUs.
Before building the index, the user has to choose between two modes :
Sharding mode : The index dataset is split, each GPU trains its own index with its respective share of the dataset. This is intended to both increase the search throughput and the maximal size of the index. Index duplication mode : The index is built once on a GPU and then copied over to others. Alternatively, the index dataset is sent to each GPU to be built there. This intended to increase the search throughput.
SNMG indexes can be serialized and de-serialized. Local models can also be deserialized and deployed in index duplication mode.
Migrated from https://github.com/rapidsai/raft/pull/1993
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
By default, the replicated mode splits the query and gives an equal share to each GPU. However as discussed, to serve "online" use-cases we could instead detect single row queries and make the index run as a load balancer in this case. I think that most things should run fine for this as long as the user calls the search function in different threads. However, the search as implemented right now will systematically give the query to the first GPU in the clique. We could make this a random attribution instead. However, what would happen if an local index is used twice in parallel? We could maybe have a thread safe (mutex protected) structure in the MG index to ensure secure access of local indexes. Do you guys have insights on this?
cc @cjnolet @tfeher @achirkin
We could make this a random attribution instead. However, what would happen if an local index is used twice in parallel? We could maybe have a thread safe (mutex protected) structure in the MG index to ensure secure access of local indexes. Do you guys have insights on this?
Honestly, I think rather than making a special case for single-query batches, it'd be fine to just use an atomic counter and always send the next query to the next device. You could do this, for example, by incrementing a number to a particular max (e.g. number of gpus) and resetting it back to 0 each time it hits that max. This should be well documented, of course, but it would give users the ability to batch queries themselves and the behavior would be consistent across all batch sizes.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
/ok to test
/ok to test
/merge
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
CI error here seemed to have been a connection error:
[rapids-conda-retry] Exiting, no retryable mamba errors detected: 'ChecksumMismatchError:', 'ChunkedEncodingError:', 'CondaHTTPError:', 'CondaMultiError:', 'Connection broken:', 'ConnectionError:', 'DependencyNeedsBuildingError:', 'EOFError:', 'JSONDecodeError:', 'Multi-download failed', 'Timeout was reached', segfault exit code 139
/ok to test
/ok to test
/ok to test
/ok to test
/merge