dgl [Draft][Feature] Cooperative minibatching

Description

Hi, during my internship at NVIDIA Devtech AI division, I worked on exploring ways to accelarate minibatch training on GNNs. During this time, we noticed that if the work of a batch is given by W(B), then the following inequality holds: W(PB) <= PW(B). This PR implements the idea of multiple GPUs cooperatively process a minibatch of size PB instead of processing minibatches of size B separately. On machines with NVLink, this way of training is faster in most cases we have encountered.

@nv-dlasalle mentored me on this project. A technical report explaining the work can be found here: https://arxiv.org/abs/2310.12403. This branch is currently not up to date though.

In January 2023, this work was submitted to ICML. The January 2023 version of the paper can be found here: Cooperative Minibatching ICML submission. Subsequent work by @sandeep06011991 in March 2023 explored a similar but more limited version.

Example run command: export MASTER_ADDR=127.0.0.1 export MASTER_PORT=25555 torchrun --nnodes=1:64 --nproc_per_node=1 --rdzv_id=123123123 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} train_dist_coop.py --num-epochs=1024 --dataset=ogbn-products --batch-size=3200 --train --replication=8

Checklist

Please feel free to remove inapplicable items for your PR.

[ ] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
[ ] Changes are complete (i.e. I finished coding on this PR)
[ ] All changes have test coverage
[ ] Code is well-documented
[x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
[ ] If the PR is for a new model/paper, I've updated the example index here.

Changes

Aug 05 '22 21:08 mfbalin

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

Aug 05 '22 21:08 dgl-bot

Commit ID: 815b8c1a48605de5d528d65ba1699027349ede56

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

Aug 05 '22 21:08 dgl-bot

Commit ID: 02e96900869f86346069ceb9445169fbc17ee10a

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

Aug 05 '22 22:08 dgl-bot

Commit ID: 5cb2c02a77d8541a21334c9e69a172014254c6a0

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

Aug 08 '22 20:08 dgl-bot

@nv-dlasalle Could you provide more context on this, e.g., a technical report or RFC?

Aug 15 '22 06:08 mufeili

@mufeili This is proof-of-concept showing running a single mini-batch across multiple GPUs (and the advantages to doing so). A technical report still needs to be created.

Before it could be suitable for merging, I think we would need a good way to wrap GCN modules, such that the wrapper handles message passing between GPUs during aggregation operations, that would look similar to PyTorch's DDP.

Aug 18 '22 02:08 nv-dlasalle

Right now, I am storing some extra objects on the sampled DGLBlock objects to store a reference to the DistGraph object so that the DistConvLayer can do the communication. But as @nv-dlasalle mentioned, a more principled way of achieving this is needed before this PR can be merged. I am open to suggestions on how to achieve that.

We will start on the technical report soon and it will hopefully be available in the next month or two.

Aug 19 '22 21:08 mfbalin

Commit ID: None

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

Sep 06 '22 01:09 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 11 '22 21:10 dgl-bot

Commit ID: b4f7ab20b7630b76305ba145a9ea21367e55e254

Build ID: 6

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 11 '22 21:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 11 '22 21:10 dgl-bot

Commit ID: 7c25dd12eb9cb59311302a9c6e35ec68da811b5d

Build ID: 7

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 11 '22 21:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 12 '22 02:10 dgl-bot

Commit ID: 20eb2f63c6b612793ea4accd7ed1520dfe632ae1

Build ID: 8

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 12 '22 02:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 12 '22 21:10 dgl-bot

Commit ID: 0609f6ee2fd7201d6fda3ec09b23eb45895099b5

Build ID: 9

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 12 '22 21:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 13 '22 18:10 dgl-bot

Commit ID: fca60ff4ab2a4552c22387f1071ea35f0df25e13

Build ID: 10

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 13 '22 18:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 13 '22 21:10 dgl-bot

Commit ID: 08f862624acc53b176cf28c78aed3843897c1e99

Build ID: 11

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 13 '22 21:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 20 '22 20:10 dgl-bot

Commit ID: 3772de4d763a5c3e09c07f52bb9bee94d51aa35f

Build ID: 12

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 20 '22 20:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Oct 20 '22 21:10 dgl-bot

Commit ID: 73d7495ed06b20a14e312992c7e8e06905ef3dc3

Build ID: 13

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Oct 20 '22 21:10 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Nov 22 '22 01:11 dgl-bot

Commit ID: 75209b57d106d39f2d4d67ff31fff799bcc48e73

Build ID: 14

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

Nov 22 '22 01:11 dgl-bot

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

Dec 13 '22 18:12 dgl-bot

dgl dgl copied to clipboard

[Draft][Feature] Cooperative minibatching

Description

Checklist

Changes

dgl
dgl copied to clipboard