Walker comments

Repositories
Issues
Comments

Results 5 comments of


                                            Walker

预训练模型保存出现问题

可能你yml文件里面的实验名称和这个是一样的吧？并且指定了auto_resume

[torchtitan][replicate] experimenting new replicate integration with torchtitan

In this case, Mixed precision of `replicate_with_fsdp` should be handled by fully_shard instead of AMP. This means that we need to modify `torchtitan/distributed/utils.py/maybe_enable_amp()` to accommodate `replicate_with_fsdp` . By the way,...

[torchtitan][replicate] experimenting new replicate integration with torchtitan

> my request changes is mainly on 2d mesh. we should target 1d mesh for landing. it's a user contract in public facing api I think the use of 2D...

why not using set_requires_all_reduce for hsdp

By the way, unlike `set_requires_gradient_sync`, `set_requires_all_reduce` does not incur an additional memory burden. @tianyu-l

Walker

Empty training_states

预训练模型保存出现问题

[torchtitan][replicate] experimenting new replicate integration with torchtitan

[torchtitan][replicate] experimenting new replicate integration with torchtitan

why not using set_requires_all_reduce for hsdp