dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[Roadmap] DGL General Roadmap

Open jermainewang opened this issue 3 years ago • 12 comments

To the entire DGL community,

It has been more than two years (actually 33 months) since I clicked the Make It Public button of the repo, at which time I was not expecting more than a small codebase for playing with some fancy new models called Graph Neural Networks. Throughout the years, it is amazing to see the project growing with the area of Graph Deep Learning (which accidentally shares the same set of first letters with DGL), extending its scope from single machine to distributed training, becoming backbones of other packages and foundations of exciting new researches. But what makes me more honored is the wonderful community and contributors. It is your advises, questions, feedback, issue reports and PRs that make this project thrived. As we are heading to the third year anniversary, it is time for us to think about the next stage for DGL that is the first stable release. Of course, there are still a tons of work to be done before that happens so we would like to share the plan with everyone so you guys can chime in your thoughts.

[03/16/22] Updated the list according to the new v0.8 release.

Documentation

DGL v1.0 will provide a full-fledged set of documentation including tutorials, user guide, API reference for users from beginner level to expert. The major focus will be a set of HeteroGNN tutorials for beginners and advanced documentation for in-depth DGL developers.

  • [x] [Tutorial] Creating a heterogeneous graph dataset from CSV
  • [ ] [Tutorial] Heterograph node classification
  • [ ] [Tutorial] Heterograph link prediction
  • [ ] [Tutorial] Heterogeneous graph node/link prediction with sampling
  • [ ] [Tutorial] Writing a heterograph NN module
  • [x] [Tutorial] Distributed link prediction @ruisizhang123 #3993
  • [ ] [Tutorial] Inductive Learning with DGL
  • [ ] Developer guide
  • [x] General advices of using edge features
  • [x] [User guide] Implement custom graph sampler
  • [ ] Clean up the “Paper Study with DGL” tutorials. Update the out-dated contents.
  • [x] [Blog] Feature attribution with Captum
  • [ ] [Blog] Spatial-temporal GNN models (e.g., for traffic network)
  • [ ] [Blog] GNN models on Discrete-time Dynamic Graphs / Continuous-time Dynamic Graphs. @otaviocx
  • [ ] #4367

GNN models, modules, samplers and others

See the separate roadmap #3849 .

Sampling infrastructure

Besides adding more sampling algorithms, we plan to improve the sampling pipeline in terms of both system efficiency and customizability.

  • [x] [Neighbor sampling] Make MultiLayerNeighborSampler support non-uniform sampling
  • [x] [Efficiency] NodeDataLoader/EdgeDataLoader/etc. interface change proposal to enable async CPU-GPU copy
  • [x] [Efficiency] Support DGLGraph async transfer between CPU and GPU on specified stream with pinned memory
  • [x] [Efficiency] Use unified buffer to accelerate feature fetching
  • [x] Exclude edges in sample_neighbors (https://github.com/dmlc/dgl/pull/2971)
  • [ ] Finalize the interface of GraphStorage and FeatureStorage (#3600 ).
  • [ ] Integrate cuGraph sampling pipeline
  • [ ] Integrate multi-GPU sampling (#3021 )

Core infrastructure

  • [ ] Move half precision and mixed precision training out of experimental stage. Make it a default feature without the need to build from source.
  • [ ] Integrate the new CUDA gSpMM kernels.
  • [ ] Release dgl.sparse, a new backbone subpackage for sparse (adjacency) matrix and the related operators.
  • [ ] Add type annotations to all core APIs.
  • [ ] Use native PyTorch codepath for PyTorch backend (suggested by @yzh119):
    • [ ] Use PyTorch FFI system to register operators and custom data type (e.g., DGLGraph)
    • [ ] Write autograd in C++.

Distributed training infrastructure

  • [ ] Support distributed graph partitioning for link prediction (e.g., training on one set of edges but testing on others)
  • [x] Change RPC backbone to use tensorpipe
  • [ ] Replace the low-level communication stack with torch.distributed.
  • [x] Allow graph server to live after training processes finished. Allow new training process groups to connect to a running graph server.

Ecosystem

We want to see DGL being used by or using more and more amazing project in the ecosystem.

  • [ ] cuGraph: https://github.com/rapidsai/cugraph #4166
  • [x] GNNVis: A Visual Analytics Approach for Prediction Error Diagnosis of Graph Neural Networks https://arxiv.org/abs/2011.11048. Released in https://github.com/dmlc/gnnlens2
  • [ ] Tensorboard
  • [x] AArch64 wheels (#3336)

Compiler

See the separate roadmap #3850 .

DGL-Go

See the separate roadmap #3912 .

jermainewang avatar Sep 15 '21 17:09 jermainewang

Some suggestions:

  1. Official support of half precision (user should find it in pip wheels, rather than compile the library by themselves).
  2. For PyTorch backends, we should have better compatibility. 3. rewrite autograd in C++. 4. make DGLGraph compatible with Torchscript. 5. integration with torchscript and JIT. 6. Use PyTorch's ffi system.
  3. Source code side, we'd better add type hinting (add pypy checks if possible) so that user have better auto completion experience.
  4. Expose spspmm interface to user in the form of sparse matrices instead of graphs.

yzh119 avatar Sep 15 '21 23:09 yzh119

A couple more suggestions regarding C++:

  • C++ extension support: since we now have more and more dependencies and compiling already takes quite a while, it's time to allow people to extend DGL without compiling it. To do that, we will need to
    • Reorg (refactor?) the C++ code so that FFI-related code and C++ main code are separated.
    • Decide which C++ interfaces to expose.
  • Switch to PyBind11 (just adding to @yzh119) since (1) we now have more complex return values from C++ (lists are quite common), (2) people are more familiar with PyBind11, (3) our FFI came from TVM's FFI, which has since been advanced a lot (@VoVAllen could say more on this). This is related to supporting Torchscript and JIT.
  • Tensoradapter support for Tensorflow 2 eager mode.
  • Extend Tensoradapter to include more ops from PyTorch/Tensorflow. This can be a stepstone for JITting UDFs/samplers. Can extend Tensoradapter for this.

Regarding temporal graph support:

  • Better partial updates: this is essential for dynamic graph models like (1) JODIE, which updates the hidden states of the incident nodes of a single edge, and (2) TGN, which performs message passing on only the historical graph (before the timestamp of current edge).

Regarding sampling:

  • DistDGL now has a neighbor sampler working entirely in C with multithreading. I would like to see it integrated but the problem is that it will complicate DGL's codebase: using NeighborSampler goes to the efficient C implementation while using others will stick with the current Python implementation. While this is OK for industrial purposes, I think this will strain maintainability quite a bit since we will (1) maintain both C-multithreading and Python-multiprocessing sampling code, and (2) translate Python-multiprocessing code to C-multithreading code every time we receive a request.
    • I feel that an elegant solution will be JITting samplers, where users write Python code and DGL translates it to some bytecode for lightweight interpretation in C.

BarclayII avatar Sep 21 '21 17:09 BarclayII

Hi, very exciting to read all these points. Regarding Subgraph extraction: k-hop subgraph and [Subgraph sampling] SubgraphDataLoader interface proposal I would like to share how I did it for my pipeline. If anyone is interested in applications, this is a nice paper that explores that.

For edge prediction, I implemented my own k-hop subgraph sampling around node paira in a heterograph. Some features were important to me:

n_nodes : input int: How many nodes in the subgraph; min_node_per_type: input dict: quotas I would like to meet per node_type (my graph is very unbalanced. Without this, I risk not having some desired node_types in the subgraph); dist_from_center: output tuple: distances of each node from the center node pair

To do it, I did the following steps:

  1. Convert the graph to homogeneous;
  2. Remove the edge between the node pair (important especially to calculate dist_from_center);

Generate a subgraph for each center node; 3. Get out_edges() to find neighbors (each iteration is a hop); 4. If they were already visited, ignore them; 5. Sample up to n_nodes, trying to meet the min_node_per_type, until n_nodes or k-hop is achieved; 6. Merge both subgraphs

Then, I repeat this process in the new merged subgraph, but for each hop iteration, I save the distance from the center node. This is necessary because after the subgraphs are merged, the distances can change.

fmello01 avatar Sep 22 '21 17:09 fmello01

Regarding the wishlist, it might be worth to have some examples/utils about heuristics and score functions for link prediction, e.g., common neighbor, resource allocation, etc.

mufeili avatar Oct 11 '21 04:10 mufeili

Thank you all for the awesome work on this repo!

Any idea about the release date of DGL v1.0?

DomInvivo avatar Dec 08 '21 15:12 DomInvivo

Hi @jermainewang, thanks for sharing this! I'm currently doing my PhD and working with Dynamic Graphs. I would like to make myself available to help with the documentation related to that. I think these topics are related to my research and I may be helpful with them:

  • [Blog] GNN models on Discrete-time Dynamic Graphs.
  • [Blog] GNN models on Continuous-time Dynamic Graphs.
  • [Tutorial] Heterogeneous graph node/link prediction with sampling

Please, let me know how can I contribute. Thanks in advance!

otaviocx avatar Jan 07 '22 00:01 otaviocx

@otaviocx That's awesome! Let us sync on this matter.

jermainewang avatar Jan 10 '22 07:01 jermainewang

Hi @jermainewang, thanks for the awesome roadmap. I'm currently working distributed graph training and I think I can help with the following features:

  • [Tutorial] Distributed link prediction
  • Fix distributed graph partitioning to train GNNs on large heterogeneous graphs.
  • Support multiple distributed graphs to training and validation in link prediction tasks.

Please let me know if I could help. :)

ruisizhang123 avatar Jan 14 '22 06:01 ruisizhang123

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Feb 14 '22 01:02 github-actions[bot]

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Sep 17 '22 01:09 github-actions[bot]

Can we have #4668 on the 1.0 release roadmap as well?

mfbalin avatar Sep 30 '22 17:09 mfbalin

@mfbalin Sure. Let's move the discussion to the PR.

jermainewang avatar Oct 10 '22 07:10 jermainewang

@zyj-111 is interested in making a contribution for "[Blog] Spatial-temporal GNN models (e.g., for traffic network)", do we still plan to do that? @BarclayII @jermainewang @frozenbugs

mufeili avatar Nov 03 '22 06:11 mufeili

Awesome. Let's follow this up on our slack.

jermainewang avatar Nov 03 '22 08:11 jermainewang

Hi, What is the status of making DGLGraph compatible with Torchscript and the integration with JIT? I came across this thread and am not sure if this feature is available yet or not. (it was noted here) Thanks :)

Sids2k avatar Feb 02 '23 15:02 Sids2k

Hi @Sids2k , it is currently work in progress. Our first goal is to make the recently released dgl.sparse package jittable. See the discussion in https://github.com/dmlc/dgl/issues/5275

jermainewang avatar Feb 09 '23 14:02 jermainewang

Closed as 1.0 has been delivered. We will open a new thread for collecting feature requests and call for contributions.

jermainewang avatar Mar 02 '23 10:03 jermainewang