Scott Hoang comments

Results 21 comments of


                                            Scott Hoang

Gradient accumulation is not efficiently implemented for distributed recipes

So, in the case of a small model < 13B, and we want to scale it with multi nodes and increase throughput, it is better to use HSDP with device_mesh...

Gradient accumulation is not efficiently implemented for distributed recipes

This perfectly solved my problem! thank you!

Gradient accumulation is not efficiently implemented for distributed recipes

@awgu Actually, one last question: in Hybrid shard mode with multi-nodes, does "sync_module_states" still broadcast rank=0's params to rank on different nodes?

[feature request] Saving / Loading packed dataset

Sure! I would love to contribute back. Let me see what I can do

Async vllm

awesome !

safe_torch_load failed when resume from checkpoint

Hi @ebsmothers , I am running a custom recipe inherited from Lora_finetune_distributed. Everything else is kept the same except for _setup_model(...).

safe_torch_load failed when resume from checkpoint

the saved function looks like this ` def save_checkpoint( self, epoch: int, ) -> None: """ Checkpoint the state of the recipe. The constructed checkpoint state dict contains the following...

safe_torch_load failed when resume from checkpoint

@ebsmothers indeed self.adapter_settings contains all the configs specific to my adapters in the yaml configs. But based on this, I assume it isn't saved in "recipe_state.pt" itself? https://github.com/pytorch/torchtune/blob/069b12bef0b9cf735d5fb7cdc4192bfbf9abd764/torchtune/utils/_checkpointing/_checkpointer.py#L593 I also...

Loading local data for custom tasks

I resolved it. will post PR eod

Fix PERPLEXITY task

Hi @NathanHB, thanks for the response! I have reformatted the file back to its original state. Can you review? 🙏