transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Include `timeout` attribute (related to DDP) to TrainingArguments

Open gugarosa opened this issue 1 year ago • 3 comments

Feature request

Would it be possible to include a timeout attribute to the TrainingArguments dataclass, such as it will be used as an argument of the torch.distributed.init_process_group calls?

Reference: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

Motivation

Essentially, if a process uses DDP and performs a set of operations prior to using the GPUs, such as tokenization/mapping, for more than timeout seconds (defaults to 30 minutes), the process will stop and get killed due to a Socket Timeout. (issue #17106).

By adding a timeout argument to the TrainingArguments class, we will be able to let users decide in overriding the default timeout defined by PyTorch and hopefully introduce a way to prevent Socket Timeouts when mapping large datasets.

Your contribution

I could definitely submit a PR, seems to be pretty straightforward to add a new attribute to the TrainingArguments class.

gugarosa avatar Jul 07 '22 12:07 gugarosa

Hi @gugarosa

I have started the work on it. Will create PR for the same by tomorrow.

dvlshah avatar Jul 07 '22 17:07 dvlshah

That's amazing @dvlshah! Thank you so much for doing it!

gugarosa avatar Jul 07 '22 20:07 gugarosa

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 06 '22 15:08 github-actions[bot]