[gemini] async grad chunk reduce (all-reduce&reduce-scatter)

Open botbw opened this issue 1 year ago • 1 comments

📌 Checklist before creating the PR

[ ] I have created an issue for this PR for traceability
[x] The title follows the standard format: [doc/gemini/tensor/...]: A concise description
[ ] I have added relevant tags if possible for us to better distinguish different PRs
[x] I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here. if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

[ ] I have linked my PR to an issue (instruction)
[ ] My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
[ ] I have performed a self-review of my code
[x] I have added thorough tests.
[x] I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

[x] 🌝 Yes, I do.
[ ] 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

May 13 '24 10:05 botbw

trace

previous now

benchmark

before

colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100

num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration: 567.2686767578125, avg_throughput: 5.528244601700203
Throughput: 5.53 samples/sec, TFLOPS per GPU by Megatron: 172.42, TFLOPS per GPU: 152.58
Max CUDA memory usage: 53503.48 MB

now

colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100 --async-reduce

num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration: 562.5335083007812, avg_throughput: 5.574779019782774
Throughput: 5.57 samples/sec, TFLOPS per GPU by Megatron: 173.87, TFLOPS per GPU: 153.87
Max CUDA memory usage: 53504.40 MB

May 14 '24 01:05 botbw