distributed Add retry to shuffle broadcast

Add retry to shuffle broadcast

Open fjetter opened this issue 4 months ago • 3 comments

One of our P2P stress tests is failing pretty regularly distributed.tests.test_stress.test_close_connections

This failure is somewhat expected because the broadcast is connecting to all workers and the connection attempts may time out if the workers are too busy.

This is not an ideal fix (instead, removing broadcast would be terrific).

I opted to not implement a retry in broadcast itself since this would've more serious implications and the broadcast API provides everything for users to implement this themselves

I'm still missing a test but confirmed this with some manual patches. @hendrikmakait if you have an idea on how to put together an easy test this would be helpful

Oct 18 '24 10:10 fjetter

distributed distributed copied to clipboard

Add retry to shuffle broadcast

distributed
distributed copied to clipboard