dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[Dist][CI] Unit test for the new distributed partitioning pipeline

Open mufeili opened this issue 2 years ago • 6 comments

Description

I've locally successfully tested test_chunk_graph and test_partition. The test for test_dispatch raised an error and I'm not sure if this is due to an inappropriate ip config file.

Checklist

Please feel free to remove inapplicable items for your PR.

  • [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • [ ] Changes are complete (i.e. I finished coding on this PR)
  • [ ] All changes have test coverage
  • [ ] Code is well-documented
  • [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • [x] Related issue is referred in this PR
  • [ ] If the PR is for a new model/paper, I've updated the example index here.

mufeili avatar Aug 12 '22 09:08 mufeili

The test for test_dispatch raised an error and I'm not sure if this is due to an inappropriate ip config file.

Could you post the error here? And also your test setup.

jermainewang avatar Aug 13 '22 09:08 jermainewang

Your way of testing chunk_graph is ok, but for partitioning and data dispatching, could you use os.system to directly test commandlines?

jermainewang avatar Aug 13 '22 09:08 jermainewang

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot avatar Aug 15 '22 08:08 dgl-bot

The test for test_dispatch raised an error and I'm not sure if this is due to an inappropriate ip config file.

Could you post the error here? And also your test setup.

dispatch_data.py attempts connecting to a cluster via ssh based on the IP configuration file provided. I did not realize it previously and used 127.0.0.1, which is incorrect. Locally I can open some machines and list their IPs in the IP configuration file. The question is then how should we test this on CI?

mufeili avatar Aug 15 '22 08:08 mufeili

Your way of testing chunk_graph is ok, but for partitioning and data dispatching, could you use os.system to directly test commandlines?

Done. I slightly modified random_partition.py so that os.system can work.

mufeili avatar Aug 15 '22 08:08 mufeili

Commit ID: 9c8926aeb5580164f50d5385743258bfd8c9f604

Build ID: 1

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 15 '22 09:08 dgl-bot

Commit ID: b6bb0dfe9091827e9a4a3f8cd1b977370d3f4014

Build ID: 2

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 05:08 dgl-bot

Commit ID: 62b63dffeefbf9963c31ac01ce6150d380835676

Build ID: 3

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 09:08 dgl-bot

Commit ID: e138040fbfe4b8821e7940a99d2466d6ebe405d8

Build ID: 4

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 09:08 dgl-bot

Commit ID: ae3731347645461c818359b5102b3e3090949cd8

Build ID: 5

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 10:08 dgl-bot

Commit ID: 80257f751b159cf48dd7343299989914fac95b16

Build ID: 6

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 16 '22 11:08 dgl-bot

It seems that load_tensors('.../edge_feat.dgl') gives an empty dict at the end of test_dispatch. Is this an expected behavior? @jermainewang @kylasa

mufeili avatar Aug 17 '22 04:08 mufeili

Commit ID: c33b7a6bd6d2ac319f34c05f89705cb4794d26b5

Build ID: 7

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 17 '22 04:08 dgl-bot

It seems that load_tensors('.../edge_feat.dgl') gives an empty dict at the end of test_dispatch. Is this an expected behavior? @jermainewang @kylasa

Confirmed with @jermainewang that this is expected

mufeili avatar Aug 17 '22 05:08 mufeili

Commit ID: ab07030eaa6ca1ddadf59aaa9c0042f893718048

Build ID: 8

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 17 '22 08:08 dgl-bot

Commit ID: 735fc0bb5259973042c51ee201517e60d346c811

Build ID: 9

Status: ❌ CI test failed in Stage [Torch CPU].

Report path: link

Full logs path: link

dgl-bot avatar Aug 17 '22 09:08 dgl-bot

Commit ID: 50bd5b13c26ed4d72c35786c25e7885a464112ac

Build ID: 10

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 17 '22 11:08 dgl-bot

Commit ID: 965a94d9913efc6fe1948d83a2efd3c63f93f33f

Build ID: 11

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 17 '22 12:08 dgl-bot

Commit ID: a4e58c47389ceff27303444710836a2b121e76a8

Build ID: 12

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 02:08 dgl-bot

Commit ID: 79d53ae6335bd8b29f5a1209acfe815457431794

Build ID: 13

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 03:08 dgl-bot

Commit ID: 2946db6e798a458ac6d45fe184ca26cccc3f2ca7

Build ID: 14

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 09:08 dgl-bot

Commit ID: ee8fe6b35abf44ddaa799af63b24e3b59b17c3f0

Build ID: 15

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Aug 18 '22 14:08 dgl-bot

Commit ID: e7c03eb55ad3d2830f4c907f159aac99384db703

Build ID: 16

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 19 '22 06:08 dgl-bot