ColossalAI
ColossalAI copied to clipboard
[gemini] async grad chunk reduce (all-reduce&reduce-scatter)
📌 Checklist before creating the PR
- [ ] I have created an issue for this PR for traceability
- [x] The title follows the standard format:
[doc/gemini/tensor/...]: A concise description - [ ] I have added relevant tags if possible for us to better distinguish different PRs
- [x] I have installed pre-commit:
pip install pre-commit && pre-commit install
🚨 Issue number
Link this PR to your issue with words like fixed to automatically close the linked issue upon merge
e.g.
fixed #1234,closed #1234,resolved #1234
📝 What does this PR do?
Summarize your work here. if you have any plots/diagrams/screenshots/tables, please attach them here.
💥 Checklist before requesting a review
- [ ] I have linked my PR to an issue (instruction)
- [ ] My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
- [ ] I have performed a self-review of my code
- [x] I have added thorough tests.
- [x] I have added docstrings for all the functions/methods I implemented
⭐️ Do you enjoy contributing to Colossal-AI?
- [x] 🌝 Yes, I do.
- [ ] 🌚 No, I don't.
Tell us more if you don't enjoy contributing to Colossal-AI.
trace
previous
now
benchmark
before
colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100
num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration: 567.2686767578125, avg_throughput: 5.528244601700203
Throughput: 5.53 samples/sec, TFLOPS per GPU by Megatron: 172.42, TFLOPS per GPU: 152.58
Max CUDA memory usage: 53503.48 MB
now
colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100 --async-reduce
num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration: 562.5335083007812, avg_throughput: 5.574779019782774
Throughput: 5.57 samples/sec, TFLOPS per GPU by Megatron: 173.87, TFLOPS per GPU: 153.87
Max CUDA memory usage: 53504.40 MB