HugeCTR icon indicating copy to clipboard operation
HugeCTR copied to clipboard

[Requirement] Add python DLPack interface for HPS lookup

Open nv-dlasalle opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. In HPS, for GNNs it would be great to have a python lookup in function which could take in indices on the GPU and return the gathered rows also on the GPU.

Describe the solution you'd like It would be great to have a python function like taking in DLPack objects (https://dmlc.github.io/dlpack/latest/python_spec.html):

def lookup_from_dlpack(indices, out): """ Gather the rows associated with the indices into the tensor out.

Parameters:

indices : DLPack capsule
  The input indices on the GPU to fetch the corresponding rows.
out : DLPack capsule
  The output memory location the lookup--should be of shape [indices.shape[0], embedding_size].

"""

The inputs don't have to be dlpack objects, but at least should be object convertible to dlpack (e.g., PyTorch Tensors, cuPY tensors, etc.).

Describe alternatives you've considered The alternative is to create a second library to wrap the C++ interface of HPS and provide the above function there.

Additional context This would be used both in training and inference of GNNs at large scale.

nv-dlasalle avatar May 25 '22 00:05 nv-dlasalle

Thanks for your feedback! Support for dlpack will be included in a future release. But the limitations of the python interface need to be clarified.

  • The python interface provided in HPS is an encapsulation of important native HPS C++ APIs. Its purpose is to facilitate users to quickly and easily verify the logic of HPS for POC or demo, not for the industry production environment.

  • The dynamic interpretation of python limits native c++ performance, especially for different types of tensor conversion.

  • Due to the shortcomings of python for concurrent programming, it is difficult to implement complex deployment scenarios through the python interface(like cuda initialization issue for multi-process/multi-thread on multi-GPU) Therefore, the loss of scalability and performance is inevitable.

Due to the hierarchical structure design, HPS is more naturally suitable for custom integration into general-purpose inference platforms, which means that complex deployment scenarios can be customized to improve inference performance. Since we have decoupled hps in 22.05, HPS can be used or encapsulated as an independent library. Therefore, the recommended integration method is to implement customized integration for different inference platforms, such as TensorRT plug-in, tf customized op, etc. Thereby, the inference performance of hps for the target platform can be maximized.

yingcanw avatar May 31 '22 13:05 yingcanw

@yingcanw Thanks for the feedback. Let me give you some more background on how this would be used for GNNs. We would use this both for training and inference, as the input to a network is a subgraph and embeddings for the nodes and/or edges in that subgraph.

Typically, for a given mini-batch, we fetch the input features associated with the input nodes of our subgraph. The number of input nodes usually ranges from 100 thousand to 1 million, and the input dimension is ranges from 64 to 4096. This means the output of lookup would be in the range 64 MB to 16 GB (a few GB would probably be most common).

We also will have 1 python process per GPU. The GIL isn't usually an issue since most work is performed asynchronously, and python is just used to schedule it. The ability to pass in a stream-id when performing a lookup would be useful in allowing computational work to overlap with the lookup, but at this point is not required.

nv-dlasalle avatar Jun 03 '22 19:06 nv-dlasalle

@nv-dlasalle Thanks for your detailed background info. If my understanding is correct, then I think you will basically not use complex HPS deployment scenarios (just 1 python process per GPU due to the large input batch size), so there will be no issue like cuda initialization for multi-process/multi-thread on multi-GPU. Therefore, through HPS supporting dlpack format tensors, it will indeed provide more convenient platform compatibility for offline inference with supporting dlpack capsule input . We will support the DLPack interface for HPS lookup in the next release.

yingcanw avatar Jun 06 '22 10:06 yingcanw

The DLpack interface has supported in version 22.07

yingcanw avatar Dec 07 '22 01:12 yingcanw