torchtune properly extend the collective timeout #2093

What is the purpose of this PR? Is it to

[ x ] fix a bug https://github.com/pytorch/torchtune/issues/2093

https://github.com/pengyanai/torchtune/blob/2d2323dfa72e83c1a15a9f6fc690e6a86dceaea7/recipes/full_finetune_distributed.py#L152-L153

Aug 05 '25 02:08 pengyanai

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2901

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Aug 05 '25 02:08 pytorch-bot[bot]

I've done a similar thing on my personal branch. I recommend reading a new config option collective_timeout, defaulting to the standard 600 with users having the option to increase.

I also added a trivially small collective after the init (summing a random tensor on each rank) because I came across this comment. I'm not certain that I'm interpreting it correctly or that it's required, but it's trivially cheap. (Specific issue)

Aug 15 '25 01:08 nathan-az

@nathan-az Thanks for sharing your approach, Nathan. Adding a configurable collective_timeout option makes sense, and I agree that giving users flexibility is useful. Including a small collective operation after init also seems like a good safeguard, even if it's not strictly required. I'll look into the linked issues and comments—appreciate the reference!

Sep 01 '25 02:09 pengyanai