Parth Mannan comments

Results 22 comments of


                                            Parth Mannan

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Just tried this branch and while I do see some small overlap with the first few layers, majority is still not overlapped and the AllGathers are launched much ahead of...

Use nvFuser executor decisions to pass on op execution to a different backend and retire hybird `torch_compile_cat_ex` executor.

Another alternative option which doesn't involve nvFuser magically having the context on what is RoPE and what is the best group of ops to send to another executor - ###...

Makes cudnn a default executor

> Thank you! > > @parthmannan how do you feel about this from perf perspective? one concern might be the `_make_cudnn_sdpa_*_graph` expense, but maybe that's 1) cheaper than i worry...

Likely memory fragmentation for larger models

@kshitij12345 - From our discussion, I remember you were looking into this. I believe this is what is causing the memory operations but needs further investigation.

Likely memory fragmentation for larger models

@eqy Yes, it does. Either of the two env variables give the same performance benefit. Is this fair to call this a memory fragmentation issue or is this something else...

Likely memory fragmentation for larger models

cc - @IvanYashchuk @mruberry Can we enable this env var by default in Thunder or should we rely on nvidia containers do enable this?

Enable cuDNN executor by default

So far I have seen cuDNN SDPA give performance improvement and haven't seen any failures across a few models. Will test a few more models/configs and can confirm but I...

Distributed and Bucketing Performance Improvements

Update: ZeRO2 AllGather overlap issues were fixed in #383 and the performance is looking much better now.

[benchmark] migrate to fsdp/ddp after jit, from fsdp/ddp before jit

There may already be a plan for this since this is still a draft and apologies if I am jumping ahead but given the big increase in memory consumption with...

[benchmark] migrate to fsdp/ddp after jit, from fsdp/ddp before jit

Just so I understand the snapshot above, the blue markers are memory allocation during the training step right? Do we know the reason why `fsdp(jit(model))` has higher consumption? Is it...