NeMo
NeMo copied to clipboard
Can't train/finetune a model on two RTX4090
Describe the bug
Can't train/finetune TitaNet model using two RTX4090.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[22], line 1
----> 1 trainer.fit(speaker_model)
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:532, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
530 self.strategy._lightning_module = model
531 _verify_strategy_supports_compile(model, self.strategy)
--> 532 call._call_and_handle_interrupt(
533 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
534 )
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
40 try:
41 if trainer.strategy.launcher is not None:
---> 42 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
43 return trainer_fn(*args, **kwargs)
45 except _TunerExitException:
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:101, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
99 self._check_torchdistx_support()
100 if self._start_method in ("fork", "forkserver"):
--> 101 _check_bad_cuda_fork()
103 # The default cluster environment in Lightning chooses a random free port number
104 # This needs to be done in the main process here before starting processes to ensure each rank will connect
105 # through the same port
106 assert self._strategy.cluster_environment is not None
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/lightning_fabric/strategies/launchers/multiprocessing.py:192, in _check_bad_cuda_fork()
190 if _IS_INTERACTIVE:
191 message += " You will have to restart the Python kernel."
--> 192 raise RuntimeError(message)
RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.
Steps/Code to reproduce bug
- Open https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
- Set
config.trainer.devices = 2 - Run it
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of NeMo install:
conda create --name nemo python==3.12.3
conda activate nemo
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install pip
conda install conda-forge::cython
pip install nemo_toolkit['asr']
Environment details
- OS version: Ubuntu 24.04 Beta
- PyTorch version: 2.2.2
- Python version: 3.12.3
Additional context
2x RTX4090
Looks to me an issue with environment.
Pls run with python 3.10
FYI: @athitten
This is because I ran it in Jupyter Notebook.