Iris Z

Results 34 comments of Iris Z

CONFIG_FILE=./train_configs/llama_1b.toml ./run_llama_train.sh @chauhang Are you using the llama_7b.toml? or do you have a llama_1b.toml that is not checked in in main? Just want to make sure I have the exact...

> train_timeout_seconds @wconstab Thanks for looking into the issue. If it's what you suspected, I think changing `dcp.save` to `dcp.async_save` would potentially help this, as we would de-stages the state_dict...

> @awgu Thanks for your review. I'll add some unit tests. Thanks! unit test can go inside https://github.com/pytorch/pytorch/blob/main/test/distributed/test_device_mesh.py#L53

The RFC looks good to me. The only thing that I can think of is that when both save_folder and load_folder exists and they don't have the same parent folder,...