Chien-Chin Huang
Chien-Chin Huang
We do not plan to support DDP + TP as we have not identified any major use cases for this combination. When working with large models, it is more common...
We didn't verify the accuracy but just verified the composability. So there may be accuracy issue.
I think we should disable grad clip at this moment with optimizer in backward. This is a known issue with optimizer in backward as @apaz-cli mentioned. I also heard some...
> I don't think it works to use the norm of the previous step. First, because I tried it and it didn't work very well. But also because I believe...
autocast is the right way for DDP.
https://github.com/pytorch/torchtitan/pull/647 should fix the issue.
> on 8 GPUs, DP2 TP4 > compile + selective op AC + TP: > got failure @tianyu-l What's your command? I could not reproduce the same issue on A100...
@tianyu-l > You also need selective_ac_option = 'op' instead of the default 2 That's the default value. 2 is not default. And I verified that even if I specify `--activation_checkpoint.selective_ac_option="op"`,...
@danielvegamyhre Very interesting finding. I thought inductor would prefix the timestamp to the cache folder. But it looks like I'm wrong.
I did some tests with the latest PyTorch and TorchTitan. The result contradicts with some observations above. For llama3 8b, full AC, TP8 the performance is quite bad w/ or...