NeMo
NeMo copied to clipboard
Problems running multi-gpu punctuation capitalization training
Describe the bug
Running multi-gpu training without a pre-prepared cache, crashes upon initialization with the following trace
Traceback (most recent call last):
File "examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py", line 155, in <module>
main()
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py", line 116, in main
model = PunctuationCapitalizationModel(cfg.model, trainer=trainer)
File "/workspace/nemo/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py", line 101, in __init__
super().__init__(cfg=cfg, trainer=trainer)
File "/workspace/nemo/nemo/collections/nlp/models/nlp_model.py", line 98, in __init__
super().__init__(cfg, trainer)
File "/workspace/nemo/nemo/core/classes/modelPT.py", line 138, in __init__
self.setup_training_data(self._cfg.train_ds)
File "/workspace/nemo/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py", line 469, in setup_training_data
self._train_dl = self._setup_dataloader_from_config(cfg=train_data_config, train=True)
File "/workspace/nemo/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py", line 773, in _setup_dataloader_from_config
dataset = BertPunctuationCapitalizationDataset(
File "/workspace/nemo/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py", line 993, in __init__
features = pickle.load(self.features_pkl.open('rb'))
File "/opt/conda/lib/python3.8/pathlib.py", line 1222, in open
return io.open(self, mode, buffering, encoding, errors, newline,
File "/opt/conda/lib/python3.8/pathlib.py", line 1078, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/punct_v2/cached.text_train.BertTokenizer.max_seq_length512.vocab119547.all_samples.punctuation_capitalization.pkl'
The issue arises because, the order of starting the individual processes, does not guarantee that the process with global_rank()=0
will start first. There seems to be a guard
https://github.com/NVIDIA/NeMo/blob/f9d45db36afae8d75aecc27175a854d289bffd84/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py#L984-L985
that should block the other processes, but interestingly torch.distributed.is_initialized()
returns False
so the guard is skipped.
Adding the parameter model.train_ds.use_cache=false
, does not help, since the cache is loaded irrespective of the value of this parameter https://github.com/NVIDIA/NeMo/blob/f9d45db36afae8d75aecc27175a854d289bffd84/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py#L987-L988
Steps/Code to reproduce bug
Train using multiple GPUs without a pre-prepared cache.
Expected behavior
All processes except the one with global_rank()=0
to wait for the master to finish preparing cache, and the training start afterwords.
Environment overview (please complete the following information)
- Environment location: SLURM
- Method of NeMo install:
pytorch:22.04-py3
container withnemo:1.8.2
(./reinstal.sh
without numba update).
@ekmb could you look at this or assign it to someone who can?
@PeganovAnton could you please take a look?
Related to https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3570701&cmtNo=
@yzhang123 , sorry for being late, I missed the notification. I failed to reproduce for multi-gpu mode, yet it is known issue for multi-node mode https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3570701&cmtNo= . @yzhang123 could you please provide exact setting for reproducing the bug?
There is a workaround https://github.com/NVIDIA/NeMo/tree/workaround_p_and_c_no_caching in which all processes create features. It is not really effective but Mayank Jain confirmed that it is working. I have just merged main
to workaround.
I failed to fix multi-node (torch.distributed.is_initialized() == False
) problem and @ekmb decided that she will research it.
https://github.com/NVIDIA/NeMo/pull/4410
Based on @PeganovAnton suggestion, I'm copying from https://github.com/NVIDIA/NeMo/pull/4410#discussion_r904058505, just to have as a future reference, especially since supposedly "multiprocessing in P&C is prone to hangups".
In my experience there is no guarantee that
master_device
(i.e.global_rank==0
) will be the first to start. In case it starts last all processes will basically already have built their own cache, but themaster_device
will store it for no one actually needing it.
Just a thought; Could a solution be to use distribute the preparation of cache over all processes? E.g. individual processes working on
global_rank
related chunks, (e.g. everyglobal_rank
line). And when all files are ready (i.e.world_size
number of them) join them into a single cache, then reload. This obviously assumes a shared storage.
Reviewing the current code, I don't see any real issues, however, the global_rank
and world_size
will need to be retrieved from environment variables, as torch.distributed
is not yet initialised.
Sorry but we can't do busy waiting as your pr requires. It has a measurable impact on all domain training. Please think of another solutions
@titu1994, sorry, but can you be more specific. I mean, do you a) refer to this issue and my last comment regarding distributed cache building, b) PR #4544, which uses busy waiting in exp_manager.check_resume
, or c) PR #4410 (which, btw., has already been merged into main and does busy waiting instead of torch.distributed.barrier, see https://github.com/NVIDIA/NeMo/blob/6442e339a47d30a106d869d1ef29cc1294753b75/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py#L993-L995
They solve different things, but are related through the issue that torch.distributed
is not yet initialized, which is hy torch.distribued.barrier
cannot be used.
I do not have access to https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3570701&cmtNo=, so I cannot comment on that. One of previous comments mentioned that @ekmb will research into why torch.distributed
is not initialised.
Note also, that to my knowledge all of these are called only once per training (but before torch.distributed
is initialized). The distributed cache build comment in this issue and #4410 only on data loader initialisation, #4544 only on exp_manager
initialisation in case of resume training. Hence I believe none should have a measurable impact on training.
4410 is localized to one model, and it's upto that model owner to choose busy waiting. However, #4544 affects core, effectively all Nemo models, and we will not accept busy waiting there.
OK; understood. I may have found a better solution. I'll comment under the appropriate PR.
closing due to inactivity. please reopen if not resolved