OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Scaling of multinode training

Open bhaddow opened this issue 7 months ago • 2 comments

❓ The question

Hi

I am trying to optimise the scaling behaviour of multi-node training with OLMo, and I would like to know what is expected. Running an a slurm cluster, with A100 40G, I get ~12000 tokens/sec/device with 1 node (4 GPUs) but only ~9000 toks/sec/device when running with 2 nodes.

I have tested various settings, and discussed with the cluster admins, but I haven't managed to improve the scaling. This is with FSDP.

So my question is, is this the expected scaling performance of OLMo?

best Barry

bhaddow avatar May 29 '25 08:05 bhaddow

I think this is more a question of the interconnect you have. How are the A100s connected to each other?

You can probably get a fair bit of extra performance out of it by running hybrid sharding, where the model is sharded inside a single node, but across nodes it uses data parallelism.

dirkgr avatar Jun 13 '25 17:06 dirkgr

Hi Dirk

I think this is more a question of the interconnect you have. How are the A100s connected to each other?

They are connected by Infiniband.

You can probably get a fair bit of extra performance out of it by running hybrid sharding, where the model is sharded inside a single node, but across nodes it uses data parallelism.

If i set fsdp.sharding_strategy: "HYBRID_SHARD" is that sufficient to enable hybrid sharding?

best

Barry

On 13/06/2025 18:38, Dirk Groeneveld wrote: [https://avatars.githubusercontent.com/u/920638?s=20&v=4]dirkgr left a comment (allenai/OLMo#845)https://github.com/allenai/OLMo/issues/845#issuecomment-2971083165

I think this is more a question of the interconnect you have. How are the A100s connected to each other?

You can probably get a fair bit of extra performance out of it by running hybrid sharding, where the model is sharded inside a single node, but across nodes it uses data parallelism.

— Reply to this email directly, view it on GitHubhttps://github.com/allenai/OLMo/issues/845#issuecomment-2971083165, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHSMGY4EFYXDMHEKKJROGD3DMEA3AVCNFSM6AAAAAB6E6JDGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZRGA4DGMJWGU. You are receiving this because you authored the thread.Message ID: @.***>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

bhaddow avatar Jun 14 '25 10:06 bhaddow