Chien-Chin Huang comments

Results 119 comments of


                                            Chien-Chin Huang

DDP (replicate) + TP?

We do not plan to support DDP + TP as we have not identified any major use cases for this combination. When working with large models, it is more common...

DDP (replicate) + TP?

We didn't verify the accuracy but just verified the composability. So there may be accuracy issue.

Optimizer in backward with grad clipping is broken

I think we should disable grad clip at this moment with optimizer in backward. This is a known issue with optimizer in backward as @apaz-cli mentioned. I also heard some...

Optimizer in backward with grad clipping is broken

> I don't think it works to use the norm of the previous step. First, because I tried it and it didn't work very well. But also because I believe...

fix mixed precision for `replicate` / pure DDP

autocast is the right way for DDP.

Wrong train_state.step when resuming from checkpoint for the second time

https://github.com/pytorch/torchtitan/pull/647 should fix the issue.

issues on llama3 compile + (async) TP + AC

> on 8 GPUs, DP2 TP4 > compile + selective op AC + TP: > got failure @tianyu-l What's your command? I could not reproduce the same issue on A100...

issues on llama3 compile + (async) TP + AC

@tianyu-l > You also need selective_ac_option = 'op' instead of the default 2 That's the default value. 2 is not default. And I verified that even if I specify `--activation_checkpoint.selective_ac_option="op"`,...

issues on llama3 compile + (async) TP + AC

@danielvegamyhre Very interesting finding. I thought inductor would prefix the timestamp to the cache folder. But it looks like I'm wrong.

issues on llama3 compile + (async) TP + AC

I did some tests with the latest PyTorch and TorchTitan. The result contradicts with some observations above. For llama3 8b, full AC, TP8 the performance is quite bad w/ or...