dgl
dgl copied to clipboard
[Dist][CI] Unit test for the new distributed partitioning pipeline
Description
I've locally successfully tested test_chunk_graph
and test_partition
. The test for test_dispatch
raised an error and I'm not sure if this is due to an inappropriate ip config file.
Checklist
Please feel free to remove inapplicable items for your PR.
- [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
- [ ] Changes are complete (i.e. I finished coding on this PR)
- [ ] All changes have test coverage
- [ ] Code is well-documented
- [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
- [x] Related issue is referred in this PR
- [ ] If the PR is for a new model/paper, I've updated the example index here.
The test for
test_dispatch
raised an error and I'm not sure if this is due to an inappropriate ip config file.
Could you post the error here? And also your test setup.
Your way of testing chunk_graph
is ok, but for partitioning and data dispatching, could you use os.system
to directly test commandlines?
To trigger regression tests:
-
@dgl-bot run [instance-type] [which tests] [compare-with-branch]
; For example:@dgl-bot run g4dn.4xlarge all dmlc/master
or@dgl-bot run c5.9xlarge kernel,api dmlc/master
The test for
test_dispatch
raised an error and I'm not sure if this is due to an inappropriate ip config file.Could you post the error here? And also your test setup.
dispatch_data.py
attempts connecting to a cluster via ssh based on the IP configuration file provided. I did not realize it previously and used 127.0.0.1
, which is incorrect. Locally I can open some machines and list their IPs in the IP configuration file. The question is then how should we test this on CI?
Your way of testing
chunk_graph
is ok, but for partitioning and data dispatching, could you useos.system
to directly test commandlines?
Done. I slightly modified random_partition.py
so that os.system
can work.
Commit ID: 9c8926aeb5580164f50d5385743258bfd8c9f604
Build ID: 1
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: b6bb0dfe9091827e9a4a3f8cd1b977370d3f4014
Build ID: 2
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 62b63dffeefbf9963c31ac01ce6150d380835676
Build ID: 3
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: e138040fbfe4b8821e7940a99d2466d6ebe405d8
Build ID: 4
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: ae3731347645461c818359b5102b3e3090949cd8
Build ID: 5
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 80257f751b159cf48dd7343299989914fac95b16
Build ID: 6
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
It seems that load_tensors('.../edge_feat.dgl')
gives an empty dict at the end of test_dispatch
. Is this an expected behavior? @jermainewang @kylasa
Commit ID: c33b7a6bd6d2ac319f34c05f89705cb4794d26b5
Build ID: 7
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
It seems that
load_tensors('.../edge_feat.dgl')
gives an empty dict at the end oftest_dispatch
. Is this an expected behavior? @jermainewang @kylasa
Confirmed with @jermainewang that this is expected
Commit ID: ab07030eaa6ca1ddadf59aaa9c0042f893718048
Build ID: 8
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 735fc0bb5259973042c51ee201517e60d346c811
Build ID: 9
Status: ❌ CI test failed in Stage [Torch CPU].
Report path: link
Full logs path: link
Commit ID: 50bd5b13c26ed4d72c35786c25e7885a464112ac
Build ID: 10
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 965a94d9913efc6fe1948d83a2efd3c63f93f33f
Build ID: 11
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: a4e58c47389ceff27303444710836a2b121e76a8
Build ID: 12
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 79d53ae6335bd8b29f5a1209acfe815457431794
Build ID: 13
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 2946db6e798a458ac6d45fe184ca26cccc3f2ca7
Build ID: 14
Status: ✅ CI test succeeded
Report path: link
Full logs path: link