NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Problems running multi-gpu punctuation capitalization training

Open itzsimpl opened this issue 2 years ago • 10 comments

Describe the bug

Running multi-gpu training without a pre-prepared cache, crashes upon initialization with the following trace

Traceback (most recent call last):
  File "examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py", line 155, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py", line 116, in main
    model = PunctuationCapitalizationModel(cfg.model, trainer=trainer)
  File "/workspace/nemo/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py", line 101, in __init__
    super().__init__(cfg=cfg, trainer=trainer)
  File "/workspace/nemo/nemo/collections/nlp/models/nlp_model.py", line 98, in __init__
    super().__init__(cfg, trainer)
  File "/workspace/nemo/nemo/core/classes/modelPT.py", line 138, in __init__
    self.setup_training_data(self._cfg.train_ds)
  File "/workspace/nemo/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py", line 469, in setup_training_data
    self._train_dl = self._setup_dataloader_from_config(cfg=train_data_config, train=True)
  File "/workspace/nemo/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py", line 773, in _setup_dataloader_from_config
    dataset = BertPunctuationCapitalizationDataset(
  File "/workspace/nemo/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py", line 993, in __init__
    features = pickle.load(self.features_pkl.open('rb'))
  File "/opt/conda/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/opt/conda/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/punct_v2/cached.text_train.BertTokenizer.max_seq_length512.vocab119547.all_samples.punctuation_capitalization.pkl'

The issue arises because, the order of starting the individual processes, does not guarantee that the process with global_rank()=0 will start first. There seems to be a guard https://github.com/NVIDIA/NeMo/blob/f9d45db36afae8d75aecc27175a854d289bffd84/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py#L984-L985 that should block the other processes, but interestingly torch.distributed.is_initialized() returns False so the guard is skipped.

Adding the parameter model.train_ds.use_cache=false, does not help, since the cache is loaded irrespective of the value of this parameter https://github.com/NVIDIA/NeMo/blob/f9d45db36afae8d75aecc27175a854d289bffd84/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py#L987-L988

Steps/Code to reproduce bug

Train using multiple GPUs without a pre-prepared cache.

Expected behavior

All processes except the one with global_rank()=0 to wait for the master to finish preparing cache, and the training start afterwords.

Environment overview (please complete the following information)

  • Environment location: SLURM
  • Method of NeMo install: pytorch:22.04-py3 container with nemo:1.8.2 (./reinstal.sh without numba update).

itzsimpl avatar Jun 01 '22 12:06 itzsimpl

@ekmb could you look at this or assign it to someone who can?

ericharper avatar Jun 01 '22 17:06 ericharper

@PeganovAnton could you please take a look?

yzhang123 avatar Jun 01 '22 18:06 yzhang123

Related to https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3570701&cmtNo=

PeganovAnton avatar Jun 07 '22 07:06 PeganovAnton

@yzhang123 , sorry for being late, I missed the notification. I failed to reproduce for multi-gpu mode, yet it is known issue for multi-node mode https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3570701&cmtNo= . @yzhang123 could you please provide exact setting for reproducing the bug?

There is a workaround https://github.com/NVIDIA/NeMo/tree/workaround_p_and_c_no_caching in which all processes create features. It is not really effective but Mayank Jain confirmed that it is working. I have just merged main to workaround.

I failed to fix multi-node (torch.distributed.is_initialized() == False) problem and @ekmb decided that she will research it.

PeganovAnton avatar Jun 07 '22 08:06 PeganovAnton

https://github.com/NVIDIA/NeMo/pull/4410

ericharper avatar Jun 22 '22 16:06 ericharper

Based on @PeganovAnton suggestion, I'm copying from https://github.com/NVIDIA/NeMo/pull/4410#discussion_r904058505, just to have as a future reference, especially since supposedly "multiprocessing in P&C is prone to hangups".

In my experience there is no guarantee that master_device (i.e. global_rank==0) will be the first to start. In case it starts last all processes will basically already have built their own cache, but the master_device will store it for no one actually needing it.

Just a thought; Could a solution be to use distribute the preparation of cache over all processes? E.g. individual processes working on global_rank related chunks, (e.g. every global_rank line). And when all files are ready (i.e. world_size number of them) join them into a single cache, then reload. This obviously assumes a shared storage.

Reviewing the current code, I don't see any real issues, however, the global_rank and world_size will need to be retrieved from environment variables, as torch.distributed is not yet initialised.

itzsimpl avatar Jun 26 '22 16:06 itzsimpl

Sorry but we can't do busy waiting as your pr requires. It has a measurable impact on all domain training. Please think of another solutions

titu1994 avatar Jul 23 '22 19:07 titu1994

@titu1994, sorry, but can you be more specific. I mean, do you a) refer to this issue and my last comment regarding distributed cache building, b) PR #4544, which uses busy waiting in exp_manager.check_resume, or c) PR #4410 (which, btw., has already been merged into main and does busy waiting instead of torch.distributed.barrier, see https://github.com/NVIDIA/NeMo/blob/6442e339a47d30a106d869d1ef29cc1294753b75/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py#L993-L995

They solve different things, but are related through the issue that torch.distributed is not yet initialized, which is hy torch.distribued.barrier cannot be used.

I do not have access to https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3570701&cmtNo=, so I cannot comment on that. One of previous comments mentioned that @ekmb will research into why torch.distributed is not initialised.

Note also, that to my knowledge all of these are called only once per training (but before torch.distributed is initialized). The distributed cache build comment in this issue and #4410 only on data loader initialisation, #4544 only on exp_manager initialisation in case of resume training. Hence I believe none should have a measurable impact on training.

itzsimpl avatar Jul 24 '22 13:07 itzsimpl

4410 is localized to one model, and it's upto that model owner to choose busy waiting. However, #4544 affects core, effectively all Nemo models, and we will not accept busy waiting there.

titu1994 avatar Jul 24 '22 18:07 titu1994

OK; understood. I may have found a better solution. I'll comment under the appropriate PR.

itzsimpl avatar Jul 24 '22 18:07 itzsimpl

closing due to inactivity. please reopen if not resolved

yzhang123 avatar Aug 25 '22 14:08 yzhang123