_githubsgi
_githubsgi
How do I map the position to a layer in the model ? Also, what is the code that decides the split ?
@soulitzer , thanks. Added a [ PyTorch PR](https://github.com/pytorch/pytorch/pull/153021) for adding a layer identification to checkpoint discrepancies.
Few questions. 1. Is there any more design/rfc docs on activation checkpointing other than [this](https://pytorch.org/blog/activation-checkpointing-techniques/) ? 2. The ac metadata is stored in CPU ? I guess the saved activations...
@tianyu-l , @soulitzer > > I do see differences in the input to layers (e.g. x) between forward and recompute. Where could that come from ? Could the RNG state...
@tianyu-l and @soulitzer , I was thinking about trying preserve_rng_state=True, The recompute difference shows only for larger networks with MoE . Also appears that distributed is not necessary for this...
It is the MoE router that sprays the tokens to different experts that exposes this issue readily.
@tianyu-l , it is hard to say from the debug log where is divergence stated. The difference shows up in .split_with_sizes.default output. ``` [rank4]: ['$182: bf16[63, 5120]', '$183: bf16[65, 5120]']...
Added the following to the save list. Needed to use full recompute tough. Still see recompute diff. ``` torch.ops.aten.topk.default, torch.ops.aten.sigmoid.default ``` @tianyu-l , @soulitzer , couple of questions on the...
@tianyu-l , the source is HuggingFace as mentioned above. I am seeing TorchTitan output as follows. 1B: INFO - Model llama3 1B size: 1,397,819,392 total parameters 3B: INFO - Model...
Thanks, I am familiar with that , where PP is set to 1. All my attempts at setting PP> 1 failed . Does the automatic slicing of layers work wit...