torchft
torchft copied to clipboard
Pass timeout on python futures to collective libraries
If collective timeouts are different for e.g. in gloo, the python code will be allowed to continue because from its perspective the future has completed. But the underlying future in cpp is not yet done. This can be problematic particularly when we try to reconfigure process groups. If underlying futures aren't done yet, we have to wait for them to complete until we can destroy the current process group.