Parth Mannan

Results 22 comments of Parth Mannan

Just tried this branch and while I do see some small overlap with the first few layers, majority is still not overlapped and the AllGathers are launched much ahead of...

Another alternative option which doesn't involve nvFuser magically having the context on what is RoPE and what is the best group of ops to send to another executor - ###...

> Thank you! > > @parthmannan how do you feel about this from perf perspective? one concern might be the `_make_cudnn_sdpa_*_graph` expense, but maybe that's 1) cheaper than i worry...

@kshitij12345 - From our discussion, I remember you were looking into this. I believe this is what is causing the memory operations but needs further investigation.

@eqy Yes, it does. Either of the two env variables give the same performance benefit. Is this fair to call this a memory fragmentation issue or is this something else...

cc - @IvanYashchuk @mruberry Can we enable this env var by default in Thunder or should we rely on nvidia containers do enable this?

So far I have seen cuDNN SDPA give performance improvement and haven't seen any failures across a few models. Will test a few more models/configs and can confirm but I...

Update: ZeRO2 AllGather overlap issues were fixed in #383 and the performance is looking much better now.

There may already be a plan for this since this is still a draft and apologies if I am jumping ahead but given the big increase in memory consumption with...

Just so I understand the snapshot above, the blue markers are memory allocation during the training step right? Do we know the reason why `fsdp(jit(model))` has higher consumption? Is it...