transformers Include `timeout` attribute (related to DDP) to TrainingArguments

Include `timeout` attribute (related to DDP) to TrainingArguments

Open gugarosa opened this issue 2 years ago • 3 comments

Feature request

Would it be possible to include a timeout attribute to the TrainingArguments dataclass, such as it will be used as an argument of the torch.distributed.init_process_group calls?

Reference: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

Motivation

Essentially, if a process uses DDP and performs a set of operations prior to using the GPUs, such as tokenization/mapping, for more than timeout seconds (defaults to 30 minutes), the process will stop and get killed due to a Socket Timeout. (issue #17106).

By adding a timeout argument to the TrainingArguments class, we will be able to let users decide in overriding the default timeout defined by PyTorch and hopefully introduce a way to prevent Socket Timeouts when mapping large datasets.

Your contribution

I could definitely submit a PR, seems to be pretty straightforward to add a new attribute to the TrainingArguments class.

Jul 07 '22 12:07 gugarosa

Hi @gugarosa

I have started the work on it. Will create PR for the same by tomorrow.

Jul 07 '22 17:07 dvlshah

That's amazing @dvlshah! Thank you so much for doing it!

Jul 07 '22 20:07 gugarosa

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 06 '22 15:08 github-actions[bot]

transformers transformers copied to clipboard

Include `timeout` attribute (related to DDP) to TrainingArguments

Feature request

Motivation

Your contribution

transformers
transformers copied to clipboard