Masaki Kozuki
Masaki Kozuki
I don't have any out of my head
This is for ddp. We could reuse this for fsdp backward as well but the semantic would be different from what we have today
setting: 20240729 nightly image & 8 A100-SXM4-80GB devices, ### Platypus-30B If I tweak the number of layers to 36, it works. W/o the tweak, it fails due to out of...
I took two memory snapshots of both to see if the memory increase comes from the training step. main  pr  It seems that the difference comes from outside...
> Just so I understand the snapshot above, the blue markers are memory allocation during the training step right? Do we know the reason why `fsdp(jit(model))` has higher consumption? Is...
https://github.com/Lightning-AI/lightning-thunder/issues/564 could be related
@jjsjann123 would you have any idea about the comment of https://github.com/Lightning-AI/lightning-thunder/pull/936#issuecomment-2274967390?
`thunder.jit` the following function with nvfuerex fails with the message below. By moving the copy for `a += b` to the end of a trace and replacing `a += b`...
multiple in-place whose operand is the func's arg is not appropriately handled. ```python import torch import thunder def f(a): return a.exp_().sin_() if __name__ == "__main__": x = torch.randn(4, device="cuda", requires_grad=False)...
> Is there anything wrong with that? No, but it would look better if variable names were like params_0, grads_1, exp_avgs_2