Kaiyang Guo
Kaiyang Guo
Hi @awgu, thanks for doing this. As FSDP + generate is very slow, I wonder does this patch also improve the efficiency?
> @kygguo Are you using `reshard_after_forward=True` / `FULL_SHARD`? I am new to FSDP, it would be nice to hint where can I check this... Basically, I am runing DPO official...
Just found it's full shard, but I think I can revise it to others if there's room to speedup.
Sure, will feedback later
Hi @awgu , passing `sharding_strategy=ShardingStrategy.SHARD_GRAD_OP` helps! When I previously use FULL_SHARD, runing the code gets stuck in `model.generate()` and never returns. Changing to SHARD_GRAD_OP avoids this, even if I use...
Thanks for all the above!
Have the same problem, and increasing NCCL timeout threshold works for me. ``` import torch.distributed as dist from datetime import timedelta dist.init_process_group(backend='nccl', init_method='env://', timeout=timedelta(hours=2)) ```
Hi, is there any update regarding this issue? It bothered me quite a few days.