properly extend the collective timeout #2093
What is the purpose of this PR? Is it to
- [ x ] fix a bug https://github.com/pytorch/torchtune/issues/2093
https://github.com/pengyanai/torchtune/blob/2d2323dfa72e83c1a15a9f6fc690e6a86dceaea7/recipes/full_finetune_distributed.py#L152-L153
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2901
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:heavy_exclamation_mark: 1 Active SEVs
There are 1 currently active SEVs. If your PR is affected, please view them below:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
I've done a similar thing on my personal branch. I recommend reading a new config option collective_timeout, defaulting to the standard 600 with users having the option to increase.
I also added a trivially small collective after the init (summing a random tensor on each rank) because I came across this comment. I'm not certain that I'm interpreting it correctly or that it's required, but it's trivially cheap. (Specific issue)
@nathan-az Thanks for sharing your approach, Nathan. Adding a configurable collective_timeout option makes sense, and I agree that giving users flexibility is useful. Including a small collective operation after init also seems like a good safeguard, even if it's not strictly required. I'll look into the linked issues and comments—appreciate the reference!