Masaki Kozuki

Results 167 comments of Masaki Kozuki

This is for ddp. We could reuse this for fsdp backward as well but the semantic would be different from what we have today

setting: 20240729 nightly image & 8 A100-SXM4-80GB devices, ### Platypus-30B If I tweak the number of layers to 36, it works. W/o the tweak, it fails due to out of...

I took two memory snapshots of both to see if the memory increase comes from the training step. main ![image](https://github.com/user-attachments/assets/686bdae4-40ca-4332-90b3-6b394e5e328f) pr ![image](https://github.com/user-attachments/assets/6e7fc815-9791-4483-bd46-c4764a804000) It seems that the difference comes from outside...

> Just so I understand the snapshot above, the blue markers are memory allocation during the training step right? Do we know the reason why `fsdp(jit(model))` has higher consumption? Is...

https://github.com/Lightning-AI/lightning-thunder/issues/564 could be related

@jjsjann123 would you have any idea about the comment of https://github.com/Lightning-AI/lightning-thunder/pull/936#issuecomment-2274967390?

`thunder.jit` the following function with nvfuerex fails with the message below. By moving the copy for `a += b` to the end of a trace and replacing `a += b`...

multiple in-place whose operand is the func's arg is not appropriately handled. ```python import torch import thunder def f(a): return a.exp_().sin_() if __name__ == "__main__": x = torch.randn(4, device="cuda", requires_grad=False)...

> Is there anything wrong with that? No, but it would look better if variable names were like params_0, grads_1, exp_avgs_2