SeeSR icon indicating copy to clipboard operation
SeeSR copied to clipboard

Training Error

Open 1343464520 opened this issue 6 months ago • 1 comments

当训练SeeSR时,执行脚本命令和参数为:CUDA_VISIBLE_DEVICES="0,1," accelerate launch train_seesr.py --pretrained_model_name_or_path="/home/hdd-sdb/SR_Code/SeeSR/preset/models/stable-diffusion-2-base" --output_dir="./experience/seesr" --root_folders '/home/hdd-sdb/SR_Code/SeeSR/preset/datasets/train_datasets/training_for_dape' --ram_ft_path '/home/hdd-sdb/SR_Code/SeeSR/preset/models/DAPE_ir_ft_32000.pth' --enable_xformers_memory_efficient_attention --mixed_precision="fp16" --resolution=512 --learning_rate=5e-5 --train_batch_size=2 --gradient_accumulation_steps=2 --null_text_ratio=0.5 --dataloader_num_workers=1 --checkpointing_steps=150000 当加载模型后开始训练时,就崩溃了,日志如下。查看nvidia-smi发现显卡丢失(重启机器后恢复),每次训练都会丢失(有时丢失一张,有时2张全部丢失),显卡驱动重新安装依然出现该问题。显卡是H800。请问怎么解决呢?谢谢! load checkpoint from /home/hdd-sdb/SR_Code/SeeSR/preset/models/ram_swin_large_14m.pth load checkpoint from /home/hdd-sdb/SR_Code/SeeSR/preset/models/ram_swin_large_14m.pth vit: swin_l vit: swin_l load lora weights from /home/hdd-sdb/SR_Code/SeeSR/preset/models/DAPE_ir_ft_32000.pth load lora weights from /home/hdd-sdb/SR_Code/SeeSR/preset/models/DAPE_ir_ft_32000.pth =================Optimize ControlNet and Unet ====================== start to load optimizer... =================Optimize ControlNet and Unet ====================== start to load optimizer... ^[[B05/13/2025 08:05:41 - INFO - main - ***** Running training ***** 05/13/2025 08:05:41 - INFO - main - Num examples = 29840 05/13/2025 08:05:41 - INFO - main - Num batches each epoch = 7460 05/13/2025 08:05:41 - INFO - main - Num Epochs = 1000 05/13/2025 08:05:41 - INFO - main - Instantaneous batch size per device = 2 05/13/2025 08:05:41 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 8 05/13/2025 08:05:41 - INFO - main - Gradient Accumulation steps = 2 05/13/2025 08:05:41 - INFO - main - Total optimization steps = 3730000 Steps: 0%| | 0/3730000 [00:00<?, ?it/s][rank0]: Traceback (most recent call last): [rank0]: File "/home/hdd-sdb/SR_Code/SeeSR/train_seesr.py", line 976, in [rank0]: latents = vae.encode(pixel_values).latent_dist.sample() [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper [rank0]: return method(self, *args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 258, in encode [rank0]: h = self.encoder(x) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/vae.py", line 141, in forward [rank0]: sample = down_block(sample) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1247, in forward [rank0]: hidden_states = resnet(hidden_states, temb=None, scale=scale) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/resnet.py", line 637, in forward [rank0]: hidden_states = self.conv1(hidden_states, scale) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/lora.py", line 163, in forward [rank0]: return F.conv2d( [rank0]: RuntimeError: CUDA error: unspecified launch failure [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank1]: Traceback (most recent call last): [rank1]: File "/home/hdd-sdb/ddj/SR_Code/SeeSR/train_seesr.py", line 976, in [rank1]: latents = vae.encode(pixel_values).latent_dist.sample() [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper [rank1]: return method(self, *args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 258, in encode [rank1]: h = self.encoder(x) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/vae.py", line 144, in forward [rank1]: sample = self.mid_block(sample) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 562, in forward [rank1]: hidden_states = attn(hidden_states, temb=temb) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 420, in forward [rank1]: return self.processor( [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1019, in call [rank1]: query = attn.to_q(hidden_states, scale=scale) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/diffusers/models/lora.py", line 224, in forward [rank1]: out = super().forward(hidden_states) [rank1]: File "/home/yjz/anaconda3/envs/YOLO/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward [rank1]: return F.linear(input, self.weight, self.bias) [rank1]: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle) Steps: 0%| | 0/3730000 [00:01<?, ?it/s]

1343464520 avatar May 14 '25 06:05 1343464520