[BUG] Cagra graph build copies dataset multiple times
Describe the bug
When we supply to cagraIndexBuild (or the equivalent C++ functions) a dataset that resides in device memory (e.g. because it was already copied, e.g. to remove strides as a workaround for #1455, or is the result of other GPU computations like quantization), a new copy of the dataset is created in device memory. This means we need (at least) twice the dataset memory on the device, meaning cagraIndexBuild is likely to fail for bigger datasets.
Steps/Code to reproduce bug
On a X GB GPU, supply to cagraIndexBuild a DLManagedTensor with dataset type kDLCUDA (dataset address on device memory) and size X * 0.6 GB (a bit more than half the ram, e.g. a 13GB dataset on a 24GB GPU). cagraIndexBuild fails with a memory error.
Expected behavior
cagraIndexBuild should provide help/guidance/a way to avoid that:
- a way of not copying accepting the performance hit could be fine
- instructions on how to shape the data (pre-stridden/padded data) so we can avoid the copy and take a "fast path"
Any way to avoid the issue and allow more data to be fitted and processed on the GPU
Additional context
Another issue that was recently brought to light is the fact that nn-descent wants to copy the whole dataset into float16. That is a separate issue above and beyond the strided/padded dataset issue, but it's another area of concern around usage of GPU memory/multiple dataset copies.