dgl
dgl copied to clipboard
[Draft][Feature] Cooperative minibatching
Description
Hi, during my internship at NVIDIA Devtech AI division, I worked on exploring ways to accelarate minibatch training on GNNs. During this time, we noticed that if the work of a batch is given by W(B), then the following inequality holds: W(PB) <= PW(B). This PR implements the idea of multiple GPUs cooperatively process a minibatch of size PB instead of processing minibatches of size B separately. On machines with NVLink, this way of training is faster in most cases we have encountered.
@nv-dlasalle mentored me on this project. A technical report explaining the work can be found here: https://arxiv.org/abs/2310.12403. This branch is currently not up to date though.
In January 2023, this work was submitted to ICML. The January 2023 version of the paper can be found here: Cooperative Minibatching ICML submission. Subsequent work by @sandeep06011991 in March 2023 explored a similar but more limited version.
Example run command:
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=25555
torchrun --nnodes=1:64 --nproc_per_node=1 --rdzv_id=123123123 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} train_dist_coop.py --num-epochs=1024 --dataset=ogbn-products --batch-size=3200 --train --replication=8
Checklist
Please feel free to remove inapplicable items for your PR.
- [ ] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
- [ ] Changes are complete (i.e. I finished coding on this PR)
- [ ] All changes have test coverage
- [ ] Code is well-documented
- [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
- [ ] If the PR is for a new model/paper, I've updated the example index here.
Changes
To trigger regression tests:
@dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example:@dgl-bot run g4dn.4xlarge all dmlc/masteror@dgl-bot run c5.9xlarge kernel,api dmlc/master
Commit ID: 815b8c1a48605de5d528d65ba1699027349ede56
Build ID: 1
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 02e96900869f86346069ceb9445169fbc17ee10a
Build ID: 2
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 5cb2c02a77d8541a21334c9e69a172014254c6a0
Build ID: 3
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
@nv-dlasalle Could you provide more context on this, e.g., a technical report or RFC?
@mufeili This is proof-of-concept showing running a single mini-batch across multiple GPUs (and the advantages to doing so). A technical report still needs to be created.
Before it could be suitable for merging, I think we would need a good way to wrap GCN modules, such that the wrapper handles message passing between GPUs during aggregation operations, that would look similar to PyTorch's DDP.
Right now, I am storing some extra objects on the sampled DGLBlock objects to store a reference to the DistGraph object so that the DistConvLayer can do the communication. But as @nv-dlasalle mentioned, a more principled way of achieving this is needed before this PR can be merged. I am open to suggestions on how to achieve that.
We will start on the technical report soon and it will hopefully be available in the next month or two.
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: b4f7ab20b7630b76305ba145a9ea21367e55e254
Build ID: 6
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 7c25dd12eb9cb59311302a9c6e35ec68da811b5d
Build ID: 7
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 20eb2f63c6b612793ea4accd7ed1520dfe632ae1
Build ID: 8
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 0609f6ee2fd7201d6fda3ec09b23eb45895099b5
Build ID: 9
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: fca60ff4ab2a4552c22387f1071ea35f0df25e13
Build ID: 10
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 08f862624acc53b176cf28c78aed3843897c1e99
Build ID: 11
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 3772de4d763a5c3e09c07f52bb9bee94d51aa35f
Build ID: 12
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 73d7495ed06b20a14e312992c7e8e06905ef3dc3
Build ID: 13
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 75209b57d106d39f2d4d67ff31fff799bcc48e73
Build ID: 14
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot