dgl icon indicating copy to clipboard operation
dgl copied to clipboard

Test disorder

Open devnkong opened this issue 2 years ago • 11 comments

https://github.com/dmlc/dgl/blob/28b09047791e1ad25bf2a890902369454d5070fc/examples/pytorch/ogb/ogbn-mag/hetero_rgcn.py#L442

Hey folks,

It seems that the dataloader here will change the order of nodes, and in that case the y_true won't match the correct nodes. Below is what I got.

Run: 01, Epoch: 01, Loss: 2.3240, Train: 1.77%, Valid: 1.78%, Test: 2.31%
Epoch 01: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:09<00:00, 2527.83it/s]
Full Inference: 100%|██████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:13<00:00, 9987.59it/s]
Run: 01, Epoch: 02, Loss: 1.5152, Train: 1.73%, Valid: 1.76%, Test: 2.18%
Epoch 02: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:13<00:00, 2487.80it/s]
Full Inference: 100%|█████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:08<00:00, 10736.15it/s]
Run: 01, Epoch: 03, Loss: 1.0953, Train: 1.69%, Valid: 1.68%, Test: 2.07%

Please take a look, thanks!

devnkong avatar Jul 07 '22 21:07 devnkong

I saw shuffle=False is provided. Do you mean the node order is still random?

jermainewang avatar Jul 11 '22 02:07 jermainewang

Yes, after I tried to reorder the y and x the results look normal, so I guess the problem is the order is still shuffled.

devnkong avatar Jul 11 '22 04:07 devnkong

Does https://github.com/dmlc/dgl/pull/4147 fix your problem?

BarclayII avatar Jul 11 '22 06:07 BarclayII

Does #4147 fix your problem?

Right, exactly the same problem. I think you guys have fixed it in the nightly version?

devnkong avatar Jul 11 '22 08:07 devnkong

Yes, it should have been fixed in nightly version. Could you please verify it?

jermainewang avatar Jul 11 '22 09:07 jermainewang

Hey guys, I tried to use nightly build DGL and I can fully replicate the results from @zjost as below

https://github.com/dmlc/dgl/blob/28b09047791e1ad25bf2a890902369454d5070fc/examples/pytorch/ogb/ogbn-mag/README.md?plain=1#L19-L26

However, when I use dgl==0.8.2 I cannot replicate that.

When directly run the code, I got the results as shown above

Run: 01, Epoch: 01, Loss: 2.3240, Train: 1.77%, Valid: 1.78%, Test: 2.31%
Epoch 01: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:09<00:00, 2527.83it/s]
Full Inference: 100%|██████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:13<00:00, 9987.59it/s]
Run: 01, Epoch: 02, Loss: 1.5152, Train: 1.73%, Valid: 1.76%, Test: 2.18%
Epoch 02: 100%|████████████████████████████████████████████████████████████████████████████████████| 629571/629571 [04:13<00:00, 2487.80it/s]
Full Inference: 100%|█████████████████████████████████████████████████████████████████████████████| 736389/736389 [01:08<00:00, 10736.15it/s]
Run: 01, Epoch: 03, Loss: 1.0953, Train: 1.69%, Valid: 1.68%, Test: 2.07%

After I tried to fix the ordering issue myself, the results look normal but just too good to be true:

Highest Train: 83.99
Highest Valid: 53.31
Final Train: 83.99
Final Test: 51.12

I posted my implementation here, the only modification happens in the test function as below: https://github.com/devnkong/dgl-rgcn-mag/blob/8713e21cfe971c97ce019bb90d2c0e8e469a169d/train_rgcn.py#L468-L501

Also I think my modification is ok because when I run it with nightly dgl the result looks normal again, FYI.

devnkong avatar Jul 11 '22 21:07 devnkong

In short, the nightly version looks good to me, and dgl==0.8.2 yields weird results. Thanks.

devnkong avatar Jul 11 '22 21:07 devnkong

DGL team: can you please describe the impact of this bug? Has this always been a problem? If not, which versions of DGL/PyTorch are impacted?

zjost avatar Jul 12 '22 13:07 zjost

The referenced HeteroRGCN example evaluates node embeddings in batches. It is supposed to make predictions in the order of node 0~ |V|-1 so we can compare with ground truth y_truth array. However, the bug causes the dataloader to return batches in random order so the y_pred and y_truth arrays cannot match, causing the following lines of accuracy calculation to be wrong:

https://github.com/dmlc/dgl/blob/28b09047791e1ad25bf2a890902369454d5070fc/examples/pytorch/ogb/ogbn-mag/hetero_rgcn.py#L475-L488

jermainewang avatar Jul 13 '22 09:07 jermainewang

Yes, thank you. I'm wondering which versions of DGL/PyTorch are affected by this bug.

zjost avatar Jul 20 '22 15:07 zjost

Yes, thank you. I'm wondering which versions of DGL/PyTorch are affected by this bug.

The bug was fixed in DGL 0.9. Previous versions may be affected by this.

jermainewang avatar Jul 25 '22 06:07 jermainewang

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Aug 25 '22 01:08 github-actions[bot]