dgl
dgl copied to clipboard
DistDGL doesn't create the mapping between etype/ntype and etype/ntype IDs correctly.
🐛 Bug
When DistDGL loads the graph partitions to memory, it loads the mapping of ntype/etype and IDs as dicts from the JSON file. Afterwards, it converts the dicts to python list, where the index on the Python list is the ntype/edge IDs. However, the current implementation adds the ntype/etypes to the list based on the order of them in the dict and ignore the ntype/etype IDs in the dict. This wrong mapping causes a problem if users train on one graph and inference on another graph and the training and inference graphs store the ntype/etypes in the dict in a different order.
The right fix is to create the ntype/etype list correctly, which is indexable with the ntype/etype IDs in the dict.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
- DGL Version (e.g., 1.0):
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
- OS (e.g., Linux):
- How you installed DGL (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version (if applicable):
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Additional context
Hi Da, can you help us clarify the reproduce steps? We are having a hard time guessing which piece went wrong :(
However, the current implementation adds the ntype/etypes to the list based on the order of them in the dict and ignore the ntype/etype IDs in the dict.
which code snippet?
In current graph partition book, ntype/etype IDs are used. https://github.com/dmlc/dgl/blob/303b150f8c6b6dbd6ea2b3ca724ded64ae14cb5f/python/dgl/distributed/graph_partition_book.py#L789-L791
do you mean this part from dgl.distributed.load_partition()
?
https://github.com/dmlc/dgl/blob/303b150f8c6b6dbd6ea2b3ca724ded64ae14cb5f/python/dgl/distributed/partition.py#L109
@zheng-da with this PR: https://github.com/dmlc/dgl/pull/5872, I think the case you mentioned should be ok as long as etypes
/ntypes
are same in train and inference graph. Because ntype/etype list are used at here only. The mapping from etype to etype_id is read from etypes
fields of part_config json.
If you could share a minimum repro, I will look into it.