CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

2B的模型可以支持微调高分辨吗,报错RuntimeError: integer out of range

Open QingQingS opened this issue 1 year ago • 4 comments

2B的模型可以支持微调高分辨吗,尝试训练分辨率为(720, 1024)的模型,在vae阶段的x_rest = torch.nn.functional.avg_pool1d(x_rest, kernel_size=2, stride=2),会报错RuntimeError: integer out of range

QingQingS avatar Sep 13 '24 07:09 QingQingS

不能,分辨率得是720 480

zRzRzRzRzRzRzR avatar Sep 13 '24 11:09 zRzRzRzRzRzRzR

谢谢

QingQingS avatar Sep 13 '24 12:09 QingQingS

在微调720,480分辨率模型时候 ,出现了下面问题,请问这是什么情况呢? [2024-09-13 12:06:43,971] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------- [2024-09-13 12:06:43,972] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------- [2024-09-13 12:06:43,972] [INFO] [RANK 0] validation loss at iteration 900 | loss: 9.150388E-01 | PPL: 2.496872E+00 loss 2.880348E-01 | [2024-09-13 12:06:43,972] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------- [2024-09-13 12:11:53,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=18, lr=[0.0001864, 0.0001864], mom=[[0.9, 0.95], [0.9, 0.95]] Traceback (most recent call last): File "/zjy/qqsun/code/CogVideo/sat/train_video.py", line 225, in training_main( File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 157, in training_main iteration, skipped = train(model, optimizer, File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 359, in train lm_loss, skipped_iter, metrics = train_step(train_data_iterator, File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 498, in train_step backward_step(optimizer, model, lm_loss, args, timers) File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 534, in backward_step model.backward(loss) File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1967, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2057, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 244578) is killed by signal: Killed. [2024-09-13 12:13:16,251] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242330 closing signal SIGTERM [2024-09-13 12:13:16,254] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242331 closing signal SIGTERM

QingQingS avatar Sep 13 '24 12:09 QingQingS

Is the memory insufficient? the video read IO consumption may be relatively high

zRzRzRzRzRzRzR avatar Sep 14 '24 07:09 zRzRzRzRzRzRzR