Scaling of multinode training
❓ The question
Hi
I am trying to optimise the scaling behaviour of multi-node training with OLMo, and I would like to know what is expected. Running an a slurm cluster, with A100 40G, I get ~12000 tokens/sec/device with 1 node (4 GPUs) but only ~9000 toks/sec/device when running with 2 nodes.
I have tested various settings, and discussed with the cluster admins, but I haven't managed to improve the scaling. This is with FSDP.
So my question is, is this the expected scaling performance of OLMo?
best Barry
I think this is more a question of the interconnect you have. How are the A100s connected to each other?
You can probably get a fair bit of extra performance out of it by running hybrid sharding, where the model is sharded inside a single node, but across nodes it uses data parallelism.
Hi Dirk
I think this is more a question of the interconnect you have. How are the A100s connected to each other?
They are connected by Infiniband.
You can probably get a fair bit of extra performance out of it by running hybrid sharding, where the model is sharded inside a single node, but across nodes it uses data parallelism.
If i set fsdp.sharding_strategy: "HYBRID_SHARD" is that sufficient to enable hybrid sharding?
best
Barry
On 13/06/2025 18:38, Dirk Groeneveld wrote: [https://avatars.githubusercontent.com/u/920638?s=20&v=4]dirkgr left a comment (allenai/OLMo#845)https://github.com/allenai/OLMo/issues/845#issuecomment-2971083165
I think this is more a question of the interconnect you have. How are the A100s connected to each other?
You can probably get a fair bit of extra performance out of it by running hybrid sharding, where the model is sharded inside a single node, but across nodes it uses data parallelism.
— Reply to this email directly, view it on GitHubhttps://github.com/allenai/OLMo/issues/845#issuecomment-2971083165, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHSMGY4EFYXDMHEKKJROGD3DMEA3AVCNFSM6AAAAAB6E6JDGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZRGA4DGMJWGU. You are receiving this because you authored the thread.Message ID: @.***>
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.