distributed
distributed copied to clipboard
Add retry to shuffle broadcast
One of our P2P stress tests is failing pretty regularly distributed.tests.test_stress.test_close_connections
This failure is somewhat expected because the broadcast is connecting to all workers and the connection attempts may time out if the workers are too busy.
This is not an ideal fix (instead, removing broadcast would be terrific).
I opted to not implement a retry in broadcast itself since this would've more serious implications and the broadcast API provides everything for users to implement this themselves
I'm still missing a test but confirmed this with some manual patches. @hendrikmakait if you have an idea on how to put together an easy test this would be helpful