dgl DistDGL doesn't create the mapping between etype/ntype and etype/ntype IDs correctly.

DistDGL doesn't create the mapping between etype/ntype and etype/ntype IDs correctly.

Open zheng-da opened this issue 2 years ago • 3 comments

🐛 Bug

When DistDGL loads the graph partitions to memory, it loads the mapping of ntype/etype and IDs as dicts from the JSON file. Afterwards, it converts the dicts to python list, where the index on the Python list is the ntype/edge IDs. However, the current implementation adds the ntype/etypes to the list based on the order of them in the dict and ignore the ntype/etype IDs in the dict. This wrong mapping causes a problem if users train on one graph and inference on another graph and the training and inference graphs store the ntype/etypes in the dict in a different order.

The right fix is to create the ntype/etype list correctly, which is indexable with the ntype/etype IDs in the dict.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

DGL Version (e.g., 1.0):
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

Oct 06 '22 02:10 zheng-da

Hi Da, can you help us clarify the reproduce steps? We are having a hard time guessing which piece went wrong :(

Oct 10 '22 06:10 frozenbugs

However, the current implementation adds the ntype/etypes to the list based on the order of them in the dict and ignore the ntype/etype IDs in the dict.

which code snippet?

In current graph partition book, ntype/etype IDs are used. https://github.com/dmlc/dgl/blob/303b150f8c6b6dbd6ea2b3ca724ded64ae14cb5f/python/dgl/distributed/graph_partition_book.py#L789-L791

Oct 10 '22 07:10 Rhett-Ying

do you mean this part from dgl.distributed.load_partition()?

https://github.com/dmlc/dgl/blob/303b150f8c6b6dbd6ea2b3ca724ded64ae14cb5f/python/dgl/distributed/partition.py#L109

Oct 10 '22 09:10 Rhett-Ying

@zheng-da with this PR: https://github.com/dmlc/dgl/pull/5872, I think the case you mentioned should be ok as long as etypes/ntypes are same in train and inference graph. Because ntype/etype list are used at here only. The mapping from etype to etype_id is read from etypes fields of part_config json.

If you could share a minimum repro, I will look into it.

Jun 14 '23 08:06 Rhett-Ying

dgl dgl copied to clipboard

DistDGL doesn't create the mapping between etype/ntype and etype/ntype IDs correctly.

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

dgl
dgl copied to clipboard