transformers
transformers copied to clipboard
Include `timeout` attribute (related to DDP) to TrainingArguments
Feature request
Would it be possible to include a timeout
attribute to the TrainingArguments
dataclass, such as it will be used as an argument of the torch.distributed.init_process_group
calls?
Reference: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
Motivation
Essentially, if a process uses DDP and performs a set of operations prior to using the GPUs, such as tokenization/mapping, for more than timeout
seconds (defaults to 30 minutes), the process will stop and get killed due to a Socket Timeout
. (issue #17106).
By adding a timeout
argument to the TrainingArguments
class, we will be able to let users decide in overriding the default timeout defined by PyTorch and hopefully introduce a way to prevent Socket Timeouts
when mapping large datasets.
Your contribution
I could definitely submit a PR, seems to be pretty straightforward to add a new attribute to the TrainingArguments
class.
Hi @gugarosa
I have started the work on it. Will create PR for the same by tomorrow.
That's amazing @dvlshah! Thank you so much for doing it!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.