pytorch-operator Why worker has init container wait for master ready?

Why worker has init container wait for master ready?

Open jiaqianjing opened this issue 4 years ago • 3 comments

why not set large timeout at torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')? What's the meaning of adding this？

Jun 11 '20 09:06 jiaqianjing

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/question	0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Jun 11 '20 09:06 issue-label-bot[bot]

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')

I think it is a user-level config. We cannot rely on it at the system level.

Jun 11 '20 10:06 gaocegege

I think so, but looks like a little weak. Are there any other considerations？

Jun 11 '20 11:06 jiaqianjing

pytorch-operator pytorch-operator copied to clipboard

Why worker has init container wait for master ready?

pytorch-operator
pytorch-operator copied to clipboard