Scott Hoang

Results 21 comments of Scott Hoang

So, in the case of a small model < 13B, and we want to scale it with multi nodes and increase throughput, it is better to use HSDP with device_mesh...

@awgu Actually, one last question: in Hybrid shard mode with multi-nodes, does "sync_module_states" still broadcast rank=0's params to rank on different nodes?

Sure! I would love to contribute back. Let me see what I can do

awesome !

Hi @ebsmothers , I am running a custom recipe inherited from Lora_finetune_distributed. Everything else is kept the same except for _setup_model(...).

the saved function looks like this ` def save_checkpoint( self, epoch: int, ) -> None: """ Checkpoint the state of the recipe. The constructed checkpoint state dict contains the following...

@ebsmothers indeed self.adapter_settings contains all the configs specific to my adapters in the yaml configs. But based on this, I assume it isn't saved in "recipe_state.pt" itself? https://github.com/pytorch/torchtune/blob/069b12bef0b9cf735d5fb7cdc4192bfbf9abd764/torchtune/utils/_checkpointing/_checkpointer.py#L593 I also...

I resolved it. will post PR eod

Hi @NathanHB, thanks for the response! I have reformatted the file back to its original state. Can you review? 🙏