ControlNet
ControlNet copied to clipboard
RuntimeError when training on multiple GPUs
Hey! I'm trying to train on multiple GPUs and consistently getting the following RuntimeError. Here's the modified line in tutorial_train.py:
trainer = pl.Trainer(gpus=8, precision=32, callbacks=[logger])
As soon as I change gpus to 1, training works fine. Anyone have ideas?
The error when training on >1 GPU
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/ControlNet/tutorial_train.py", line 35, in <module>
trainer.fit(model, dataloader)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
self._dispatch()
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
process.start()
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
What I've tried
- clean-install conda environment
- with/without triton, xformers
Me too. After it print LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3], cuda just cored dump
Wrap lines 20-35 in if __name__ == '__main__': and run training CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tutorial_train.py
lines 20-35 of what file?
lines 20-35 of what file?
The file you originally wrote about. tutorial_train.py
lines 20-35 of what file?
The file you originally wrote about.
tutorial_train.py
Still have the same problem
Had the same issue, and this fixed it. Although I changed the trainer code to this but explicitly specifying to use the second GPU
trainer = pl.Trainer(accelerator="gpu", devices=[1], precision=32, callbacks=[logger])
Wrap lines 20-35 in
if __name__ == '__main__':and run trainingCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tutorial_train.py
accelerator_connector.py:287: LightningDeprecationWarning: Passing Trainer(accelerator='ddp')has been deprecated in v1.5 and will be removed in v1.7. UseTrainer(strategy='ddp') instead.
I think we shoulde use strategy='ddp'
Hi
@JunnYu
This is how I do it.
trainer = pl.Trainer(strategy="ddp", accelerator="gpu", devices=[0,1], precision=32, callbacks=[logger],max_epochs=max_epochs)
I am running it on 2X3090s.
I think the final answer is a combination of 3 answers above: by @tg-bomze, @SuroshAhmadZobair and @JunnYu . You need to apply the following modifications to the original tutorial_train.py script:
- Wrap the lines 20-35 of
tutorial_train.pywithif __name__ == "__main__":. - Make sure the GPU-s that you want to use are visible in the terminal by running
export CUDA_VISIBLE_DEVICES=0,1. If you have more GPU-s, or wish to use specific GPU-s, feel free to use your own ID-s, for exampleexport CUDA_VISIBLE_DEVICES=0,3,6,7. If you only have 2 GPU-s, then use the first command. - Modify the training script to use the specified number of GPU-s by modifying the initialization of the
pl.Trainerobject, i.e.trainer = pl.Trainer(gpus=2, precision=32, callbacks=[logger]). - It also turns out you need to set the strategy when initializing the
pl.Trainerobject, so you need to modify that line of code again to gettrainer = pl.Trainer(strategy='ddp', gpus=2, precision=32, callbacks=[logger]).
Eventually, your entire training script should look like this:
from share import *
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from tutorial_dataset import MyDataset
from cldm.logger import ImageLogger
from cldm.model import create_model, load_state_dict
# Configs
resume_path = './models/control_sd15_ini.ckpt'
batch_size = 2
logger_freq = 300
learning_rate = 1e-5
sd_locked = True
only_mid_control = False
if __name__ == "__main__":
# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
model = create_model('./models/cldm_v15.yaml').cpu()
model.load_state_dict(load_state_dict(resume_path, location='cpu'))
model.learning_rate = learning_rate
model.sd_locked = sd_locked
model.only_mid_control = only_mid_control
# Misc
dataset = MyDataset()
dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
logger = ImageLogger(batch_frequency=logger_freq)
trainer = pl.Trainer(strategy='ddp', gpus=2, precision=32, callbacks=[logger])
# Train!
trainer.fit(model, dataloader)
As you can see, the modified script does not differ much from the original script. Once you have everything setup, simply run the training command:
python tutorial_train.py
It worked for me on 2xQuadro RTX 6000.
P.S. I had to reduce the batch size from 4 to 2 to make sure I don't get an out-of memory error. If you have enough memory, you can stick to batch size 4. P.P.S. As I said earlier, I had to combine several previous responses to make things work. I thought it would be easier for others if these answers were merged into one. It is also possible that different people have different problems when running things, so I cannot guarantee that this solution is 100% correct. In any case I hope it helps.
So I did what @MikaYeghi - total copy paste , yet my code is still not running properly on 2xRTX 3090
If I run it on 2 GPUs it outputs this and hangs in there:
No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
If I run it with only 1 GPU - like default tutorial, it runs well and output is like this:
No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
---------------------------------------------------------
0 | model | DiffusionWrapper | 859 M
1 | first_stage_model | AutoencoderKL | 83.7 M
2 | cond_stage_model | FrozenCLIPEmbedder | 123 M
3 | control_model | ControlNet | 361 M
---------------------------------------------------------
1.2 B Trainable params
206 M Non-trainable params
1.4 B Total params
5,710.058 Total estimated model params size (MB)
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:110: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/25000 [00:00<?, ?it/s]/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 2. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
warning_cache.warn(
Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0
Running DDIM Sampling with 50 timesteps
DDIM Sampler: 100%|█████████████████████████████| 50/50 [00:35<00:00, 1.39it/s]
Epoch 0: 0%| | 7/25000 [00:49<48:54:16, 7.04s/it, loss=0.0126, v_num=26, trai
- any ideas?
What I also noticed is with 2 GPU setup, my cards somehow run to 100% GPU, but actually only around 170W out of 350 or 420 (one has 350 and second 420 W). Also VRAM usage goes only to 6-7 GB instead of 22 when running only on 1 GPU properly. Any ideas what can be wrong? Tried many things yesterday..
update:
- just got idea if Conda could be problem? Should I stick with pip venv ? Or CUDA? I have I believe 12.3 with PyTorch compiled for 12.1.
So I did what @MikaYeghi - total copy paste , yet my code is still not running properly on 2xRTX 3090
If I run it on 2 GPUs it outputs this and hangs in there:
No module 'xformers'. Proceeding without it. ControlLDM: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Loaded model config from [./models/cldm_v15.yaml] Loaded state_dict from [./models/control_sd15_ini.ckpt] initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes ----------------------------------------------------------------------------------------------------If I run it with only 1 GPU - like default tutorial, it runs well and output is like this:
No module 'xformers'. Proceeding without it. ControlLDM: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Loaded model config from [./models/cldm_v15.yaml] Loaded state_dict from [./models/control_sd15_ini.ckpt] GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop. rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.") /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7. rank_zero_deprecation( /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7. rank_zero_deprecation( initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 1 processes ---------------------------------------------------------------------------------------------------- LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] | Name | Type | Params --------------------------------------------------------- 0 | model | DiffusionWrapper | 859 M 1 | first_stage_model | AutoencoderKL | 83.7 M 2 | cond_stage_model | FrozenCLIPEmbedder | 123 M 3 | control_model | ControlNet | 361 M --------------------------------------------------------- 1.2 B Trainable params 206 M Non-trainable params 1.4 B Total params 5,710.058 Total estimated model params size (MB) /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:110: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/25000 [00:00<?, ?it/s]/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 2. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`. warning_cache.warn( Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0 Running DDIM Sampling with 50 timesteps DDIM Sampler: 100%|█████████████████████████████| 50/50 [00:35<00:00, 1.39it/s] Epoch 0: 0%| | 7/25000 [00:49<48:54:16, 7.04s/it, loss=0.0126, v_num=26, trai
- any ideas?
What I also noticed is with 2 GPU setup, my cards somehow run to 100% GPU, but actually only around 170W out of 350 or 420 (one has 350 and second 420 W). Also VRAM usage goes only to 6-7 GB instead of 22 when running only on 1 GPU properly. Any ideas what can be wrong? Tried many things yesterday..
update:
- just got idea if Conda could be problem? Should I stick with pip venv ? Or CUDA? I have I believe 12.3 with PyTorch compiled for 12.1.
I think I also had a similar issue at some point, but I resolved it easily... Can you check if you have all the GPUs available and visible? Maybe try running the export CUDA_VISIBLE_DEVICES=0,1 command? Your problem rings a bell, but I can't recall how I resolved it.
I think GPUs are available properly.. did not tried this export command though as I managed to run successfully simillar tutorial but from Diffusers, which uses accelerate where in accelerate I set 2 gpus and via that it works fine on this machine.. yet I still would like to run this vanilla pytorch tutorial as I dont like diffusers much.. Will try tomorrow when training is done. Thanks for help anyways :)
I think GPUs are available properly.. did not tried this export command though as I managed to run successfully simillar tutorial but from Diffusers, which uses accelerate where in accelerate I set 2 gpus and via that it works fine on this machine.. yet I still would like to run this vanilla pytorch tutorial as I dont like diffusers much.. Will try tomorrow when training is done. Thanks for help anyways :)
HI @blacklig , were you able to run the vanilla pytorch tutorial on multi-GPU setup? I'm facing similar issues.