2B的模型可以支持微调高分辨吗,尝试训练分辨率为(720, 1024)的模型,在vae阶段的x_rest = torch.nn.functional.avg_pool1d(x_rest, kernel_size=2, stride=2),会报错RuntimeError: integer out of range
在微调720,480分辨率模型时候 ,出现了下面问题,请问这是什么情况呢?
[2024-09-13 12:06:43,971] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-13 12:06:43,972] [INFO] [RANK 0] -----------------------------------------------------------------------------------------------
[2024-09-13 12:06:43,972] [INFO] [RANK 0] validation loss at iteration 900 | loss: 9.150388E-01 | PPL: 2.496872E+00 loss 2.880348E-01 |
[2024-09-13 12:06:43,972] [INFO] [RANK 0] -----------------------------------------------------------------------------------------------
[2024-09-13 12:11:53,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=18, lr=[0.0001864, 0.0001864], mom=[[0.9, 0.95], [0.9, 0.95]]
Traceback (most recent call last):
File "/zjy/qqsun/code/CogVideo/sat/train_video.py", line 225, in
training_main(
File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 157, in training_main
iteration, skipped = train(model, optimizer,
File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 359, in train
lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 498, in train_step
backward_step(optimizer, model, lm_loss, args, timers)
File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 534, in backward_step
model.backward(loss)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1967, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2057, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 244578) is killed by signal: Killed.
[2024-09-13 12:13:16,251] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242330 closing signal SIGTERM
[2024-09-13 12:13:16,254] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242331 closing signal SIGTERM
Is the memory insufficient? the video read IO consumption may be relatively high