torchtitan
torchtitan copied to clipboard
A PyTorch native library for large-scale model training
Would like to discuss adding a 'dry_run' flag as a general option. What is does: if user adds --dry_run, then the config file specified is still run with everything the...
In parallelism/__init__.py, build mesh uses zip with the kwarg strict=True. ~~~ for d, name in zip( [self.dp, self.sp, self.pp], ["dp", "sp", "pp"], strict=True ): ~~~ This is apparently 3.10+ keyword,...
Currently torch compiling the default llama model will generate a warning re: being unable to lower complex numbers. ``` torch/_inductor/lowering.py:1639: UserWarning: Torchinductor does not support code generation for complex operators....
The tentative tests we could add: 1. test the llama debug model init and forward/backward works 2. test checkpoint save/load works 3. metrics logging test (metrics to be added)
showcase context parallelism here once that feature is ready.