torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Pass timeout on python futures to collective libraries

Open tushar00jain opened this issue 6 months ago • 0 comments

If collective timeouts are different for e.g. in gloo, the python code will be allowed to continue because from its perspective the future has completed. But the underlying future in cpp is not yet done. This can be problematic particularly when we try to reconfigure process groups. If underlying futures aren't done yet, we have to wait for them to complete until we can destroy the current process group.

tushar00jain avatar Jun 03 '25 19:06 tushar00jain