pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

Why worker has init container wait for master ready?

Open jiaqianjing opened this issue 4 years ago • 3 comments

image why not set large timeout at torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')? What's the meaning of adding this?

jiaqianjing avatar Jun 11 '20 09:06 jiaqianjing

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jun 11 '20 09:06 issue-label-bot[bot]

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')

I think it is a user-level config. We cannot rely on it at the system level.

gaocegege avatar Jun 11 '20 10:06 gaocegege

I think so, but looks like a little weak. Are there any other considerations?

jiaqianjing avatar Jun 11 '20 11:06 jiaqianjing