dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[Enhancement] Supporting new version of dlpack

Open charlifu opened this issue 3 years ago • 3 comments

🚀 Feature

PyTorch's C++ API has updated its dlpack version to 0.6. But DGL is still depending on old version(0.2).

Motivation

According to dgl's compiler infra roadmap #3850, we need to reimplement dgl's sparse operations and their grad operators using PyTorch's C++ extension. And to implement the optimization of Graphiler, we have to register those sparse operations into TorchScript. This means we need to rewrite the autograd in C++. For more details, please refer this prototype.

To achieve this goal, it is inevitable to include head files from both PyTorch's C++ API (e.g. ATen/DLConvertor.h) and dgl's core C++ library (e.g. aten/coo.h) in the same C++ source file. But these two rely on different version of dlpack, which will lead to compilation error.

The most fatal difference is that in new version DLTensor's second member's name has been change to device from ctx. But NDArray, the core abstraction of dgl's low level implementation, still uses ctx when dealing with DLTensor. For example:

inline NDArray NDArray::Clone(const DGLStreamHandle &stream) const {
  CHECK(data_ != nullptr);
  const DLTensor* dptr = operator->();
  return this->CopyTo(dptr->ctx, stream);
}

Alternatives

We can still get a fix by adding a macro #define ctx device, just like PR3803. But this looks too dangerous and could introduce more issues.

@jermainewang @BarclayII @VoVAllen @yzh119

charlifu avatar Jul 20 '22 04:07 charlifu

How much effort do we need to rewrite all appearance of ->ctx?

yzh119 avatar Jul 20 '22 06:07 yzh119

Not related to effort, but potential compatibility issue. @BarclayII Why didn't we upgrade the whole dlpack in dgl instead of making a patch? I remember there's some backward compatibility issue.

VoVAllen avatar Jul 20 '22 07:07 VoVAllen

I think PyTorch only introduced DLPack 0.6 in 1.11.0+ (I can't remember), and we need to support multiple PyTorch versions older than that, meaning we need to support DLPack 0.4 and 0.6 simultaneously (and to answer @yzh119's question, we will need to support ->ctx and ->device at the same time).

BarclayII avatar Jul 25 '22 07:07 BarclayII

Renaming DLContext -> DLDevice happens in DLPack 0.4, and was adopted by PyTorch 1.9 (the minimal version we are supporting). So I believe supporting ->device is enough.

This is also necessary for #4333.

I'd like to create a draft for review and testing.

yaox12 avatar Aug 23 '22 03:08 yaox12

@yaox12 I think the preferred way of implementation is to decouple dgl::runtime::NDArray with DLPack struct. Currently, it internally holds a DLPack pointer. To avoid potential huge refactor moving forward, I think we should use our own structure so that ->device or ->ctx won't matter. The ToDLPack and FromDLPack functions should be responsible of getting/setting attributes from/to DLPack structs.

The other benefit is that we can then support different versions of DLPack easily. They will be limited only to the conversion function.

jermainewang avatar Aug 23 '22 09:08 jermainewang