当训练SeeSR时,执行脚本命令和参数为:CUDA_VISIBLE_DEVICES="0,1," accelerate launch train_seesr.py --pretrained_model_name_or_path="/home/hdd-sdb/SR_Code/SeeSR/preset/models/stable-diffusion-2-base" --output_dir="./experience/seesr" --root_folders '/home/hdd-sdb/SR_Code/SeeSR/preset/datasets/train_datasets/training_for_dape' --ram_ft_path '/home/hdd-sdb/SR_Code/SeeSR/preset/models/DAPE_ir_ft_32000.pth' --enable_xformers_memory_efficient_attention --mixed_precision="fp16" --resolution=512 --learning_rate=5e-5 --train_batch_size=2 --gradient_accumulation_steps=2 --null_text_ratio=0.5 --dataloader_num_workers=1 --checkpointing_steps=150000
当加载模型后开始训练时,就崩溃了,日志如下。查看nvidia-smi发现显卡丢失(重启机器后恢复),每次训练都会丢失(有时丢失一张,有时2张全部丢失),显卡驱动重新安装依然出现该问题。显卡是H800。请问怎么解决呢?谢谢!
load checkpoint from /home/hdd-sdb/SR_Code/SeeSR/preset/models/ram_swin_large_14m.pth
load checkpoint from /home/hdd-sdb/SR_Code/SeeSR/preset/models/ram_swin_large_14m.pth
vit: swin_l
vit: swin_l
load lora weights from /home/hdd-sdb/SR_Code/SeeSR/preset/models/DAPE_ir_ft_32000.pth
load lora weights from /home/hdd-sdb/SR_Code/SeeSR/preset/models/DAPE_ir_ft_32000.pth
=================Optimize ControlNet and Unet ======================
start to load optimizer...
=================Optimize ControlNet and Unet ======================
start to load optimizer...
^[[B05/13/2025 08:05:41 - INFO - main - ***** Running training *****
05/13/2025 08:05:41 - INFO - main - Num examples = 29840
05/13/2025 08:05:41 - INFO - main - Num batches each epoch = 7460
05/13/2025 08:05:41 - INFO - main - Num Epochs = 1000
05/13/2025 08:05:41 - INFO - main - Instantaneous batch size per device = 2
05/13/2025 08:05:41 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 8
05/13/2025 08:05:41 - INFO - main - Gradient Accumulation steps = 2
05/13/2025 08:05:41 - INFO - main - Total optimization steps = 3730000
Steps: 0%| | 0/3730000 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/hdd-sdb/SR_Code/SeeSR/train_seesr.py", line 976, in
[rank0]: latents = vae.encode(pixel_values).latent_dist.sample()
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
[rank0]: return method(self, *args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 258, in encode
[rank0]: h = self.encoder(x)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/vae.py", line 141, in forward
[rank0]: sample = down_block(sample)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1247, in forward
[rank0]: hidden_states = resnet(hidden_states, temb=None, scale=scale)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/resnet.py", line 637, in forward
[rank0]: hidden_states = self.conv1(hidden_states, scale)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/lora.py", line 163, in forward
[rank0]: return F.conv2d(
[rank0]: RuntimeError: CUDA error: unspecified launch failure
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/hdd-sdb/ddj/SR_Code/SeeSR/train_seesr.py", line 976, in
[rank1]: latents = vae.encode(pixel_values).latent_dist.sample()
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
[rank1]: return method(self, *args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 258, in encode
[rank1]: h = self.encoder(x)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/vae.py", line 144, in forward
[rank1]: sample = self.mid_block(sample)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 562, in forward
[rank1]: hidden_states = attn(hidden_states, temb=temb)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 420, in forward
[rank1]: return self.processor(
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1019, in call
[rank1]: query = attn.to_q(hidden_states, scale=scale)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/lora.py", line 224, in forward
[rank1]: out = super().forward(hidden_states)
[rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
[rank1]: return F.linear(input, self.weight, self.bias)
[rank1]: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
Steps: 0%| | 0/3730000 [00:01<?, ?it/s]