_githubsgi comments

Results 46 comments of


                                            _githubsgi

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

How do I map the position to a layer in the model ? Also, what is the code that decides the split ?

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

@soulitzer , thanks. Added a [ PyTorch PR](https://github.com/pytorch/pytorch/pull/153021) for adding a layer identification to checkpoint discrepancies.

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

Few questions. 1. Is there any more design/rfc docs on activation checkpointing other than [this](https://pytorch.org/blog/activation-checkpointing-techniques/) ? 2. The ac metadata is stored in CPU ? I guess the saved activations...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

@tianyu-l , @soulitzer > > I do see differences in the input to layers (e.g. x) between forward and recompute. Where could that come from ? Could the RNG state...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

@tianyu-l and @soulitzer , I was thinking about trying preserve_rng_state=True, The recompute difference shows only for larger networks with MoE . Also appears that distributed is not necessary for this...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

It is the MoE router that sprays the tokens to different experts that exposes this issue readily.

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

@tianyu-l , it is hard to say from the debug log where is divergence stated. The difference shows up in .split_with_sizes.default output. ``` [rank4]: ['$182: bf16[63, 5120]', '$183: bf16[65, 5120]']...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

Added the following to the save list. Needed to use full recompute tough. Still see recompute diff. ``` torch.ops.aten.topk.default, torch.ops.aten.sigmoid.default ``` @tianyu-l , @soulitzer , couple of questions on the...

Adding Llama 1B and 3B model.

@tianyu-l , the source is HuggingFace as mentioned above. I am seeing TorchTitan output as follows. 1B: INFO - Model llama3 1B size: 1,397,819,392 total parameters 3B: INFO - Model...

Is a PP+FSDP+TP config + toml available for pre-training 405B model ?

Thanks, I am familiar with that , where PP is set to 1. All my attempts at setting PP> 1 failed . Does the automatic slicing of layers work wit...