dgl
dgl copied to clipboard
Test disorder
https://github.com/dmlc/dgl/blob/28b09047791e1ad25bf2a890902369454d5070fc/examples/pytorch/ogb/ogbn-mag/hetero_rgcn.py#L442
Hey folks,
It seems that the dataloader
here will change the order of nodes, and in that case the y_true
won't match the correct nodes. Below is what I got.
Run: 01, Epoch: 01, Loss: 2.3240, Train: 1.77%, Valid: 1.78%, Test: 2.31%
Epoch 01: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:09<00:00, 2527.83it/s]
Full Inference: 100%|██████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:13<00:00, 9987.59it/s]
Run: 01, Epoch: 02, Loss: 1.5152, Train: 1.73%, Valid: 1.76%, Test: 2.18%
Epoch 02: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:13<00:00, 2487.80it/s]
Full Inference: 100%|█████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:08<00:00, 10736.15it/s]
Run: 01, Epoch: 03, Loss: 1.0953, Train: 1.69%, Valid: 1.68%, Test: 2.07%
Please take a look, thanks!
I saw shuffle=False
is provided. Do you mean the node order is still random?
Yes, after I tried to reorder the y and x the results look normal, so I guess the problem is the order is still shuffled.
Does https://github.com/dmlc/dgl/pull/4147 fix your problem?
Does #4147 fix your problem?
Right, exactly the same problem. I think you guys have fixed it in the nightly version?
Yes, it should have been fixed in nightly version. Could you please verify it?
Hey guys, I tried to use nightly build DGL and I can fully replicate the results from @zjost as below
https://github.com/dmlc/dgl/blob/28b09047791e1ad25bf2a890902369454d5070fc/examples/pytorch/ogb/ogbn-mag/README.md?plain=1#L19-L26
However, when I use dgl==0.8.2 I cannot replicate that.
When directly run the code, I got the results as shown above
Run: 01, Epoch: 01, Loss: 2.3240, Train: 1.77%, Valid: 1.78%, Test: 2.31%
Epoch 01: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:09<00:00, 2527.83it/s]
Full Inference: 100%|██████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:13<00:00, 9987.59it/s]
Run: 01, Epoch: 02, Loss: 1.5152, Train: 1.73%, Valid: 1.76%, Test: 2.18%
Epoch 02: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:13<00:00, 2487.80it/s]
Full Inference: 100%|█████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:08<00:00, 10736.15it/s]
Run: 01, Epoch: 03, Loss: 1.0953, Train: 1.69%, Valid: 1.68%, Test: 2.07%
After I tried to fix the ordering issue myself, the results look normal but just too good to be true:
Highest Train: 83.99
Highest Valid: 53.31
Final Train: 83.99
Final Test: 51.12
I posted my implementation here, the only modification happens in the test function as below: https://github.com/devnkong/dgl-rgcn-mag/blob/8713e21cfe971c97ce019bb90d2c0e8e469a169d/train_rgcn.py#L468-L501
Also I think my modification is ok because when I run it with nightly dgl the result looks normal again, FYI.
In short, the nightly version looks good to me, and dgl==0.8.2 yields weird results. Thanks.
DGL team: can you please describe the impact of this bug? Has this always been a problem? If not, which versions of DGL/PyTorch are impacted?
The referenced HeteroRGCN example evaluates node embeddings in batches. It is supposed to make predictions in the order of node 0~ |V|-1 so we can compare with ground truth y_truth
array. However, the bug causes the dataloader to return batches in random order so the y_pred
and y_truth
arrays cannot match, causing the following lines of accuracy calculation to be wrong:
https://github.com/dmlc/dgl/blob/28b09047791e1ad25bf2a890902369454d5070fc/examples/pytorch/ogb/ogbn-mag/hetero_rgcn.py#L475-L488
Yes, thank you. I'm wondering which versions of DGL/PyTorch are affected by this bug.
Yes, thank you. I'm wondering which versions of DGL/PyTorch are affected by this bug.
The bug was fixed in DGL 0.9. Previous versions may be affected by this.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you