llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Usage of composer.utils.dist.get_node_signal_file_name()

Open Andrew-Wyn opened this issue 7 months ago • 3 comments

I encountered a bug during the usage of composer.utils.dist.get_node_signal_file_name.

Setup

  • llm-foundry==release/v0.17.1

If I execute a training script on a single node I have no issue and the training starts smoothly. When I set up the multinode configuration, an error comes out.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/scripts/train/train.py", line 9, in <module>
[rank0]:     train_from_yaml(yaml_path, args_list)
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/command_utils/train.py", line 662, in train_from_yaml
[rank0]:     return train(yaml_cfg)
[rank0]:            ^^^^^^^^^^^^^^^
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/command_utils/train.py", line 366, in train
[rank0]:     tokenizer = build_tokenizer(tokenizer_name, tokenizer_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/utils/builders.py", line 545, in build_tokenizer
[rank0]:     os.remove(signal_file_path)
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '._signal_file_node0_KfmwNg'

After a bit of debugging, I noticed that the get_node_signal_file_name return the same name for each node, resulting in a race condition, since each node use the same file to assess inter-node concurrency.

I fixed such error using a previous methodology:

llmfoundry/utils/builders.py line:497

 f'.node_{dist.get_node_rank()}_local_rank0_completed_tokenizer_setup'

I think that is something related to the composer library. If this workaround is something sound, I can open an pull request.

Andrew-Wyn avatar Mar 07 '25 09:03 Andrew-Wyn