pytorch-operator
pytorch-operator copied to clipboard
Why worker has init container wait for master ready?
why not set large timeout at
torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')
?
What's the meaning of adding this?
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/question | 0.69 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')
I think it is a user-level config. We cannot rely on it at the system level.
I think so, but looks like a little weak. Are there any other considerations?