stylegan-xl
stylegan-xl copied to clipboard
RuntimeError: output is too large
I have pretrained the model from the 64x64 images. And now I am on the superresolution stage and i wanna get a 256x256 images.
python train.py --outdir=./training-runs/styleganxl_training_reduced_256 --cfg=stylegan3-t --data=./data/styleganxl_training_reduced256.zip \
--gpus=2 --batch=24 --mirror=1 --snap 10 --batch-gpu 12 --kimg 10000 --syn_layers 10 --cond True --mirror True --cbase 16384 --cmax 256 --syn_layers 7 \
--superres --up_factor 4 --head_layers 4 \
--path_stem training-runs/styleganxl_training_reduced_64/00000-stylegan3-t-styleganxl_training_reduced64-gpus2-batch176/best_model.pkl
The following error shows a RuntimeError: output is too large.
Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py:225: RuntimeWarning: filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback
warnings.warn("filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback", RuntimeWarning)
Initializing logs...
Training for 10000 kimg...
/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py:225: RuntimeWarning: filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback
warnings.warn("filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback", RuntimeWarning)
Traceback (most recent call last):
File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 336, in <module>
main() # pylint: disable=no-value-for-parameter
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 321, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 106, in launch_training
torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 49, in subprocess_fn
training_loop.training_loop(rank=rank, **c)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/training/training_loop.py", line 339, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/training/loss.py", line 121, in accumulate_gradients
loss_Gmain.backward()
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/autograd/function.py", line 87, in apply
return self._forward_cls.backward(self, *args) # type: ignore[attr-defined]
File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py", line 264, in backward
dx = _filtered_lrelu_cuda(up=down, down=up, padding=pp, gain=gg, slope=slope, clamp=None, flip_filter=ff).apply(dy, fd, fu, None, si, sx, sy)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py", line 228, in forward
y = upfirdn2d.upfirdn2d(x=y, f=fu, up=up, padding=[px0, px1, py0, py1], gain=up**2, flip_filter=flip_filter) # Upsample.
File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/upfirdn2d.py", line 161, in upfirdn2d
return _upfirdn2d_cuda(up=up, down=down, padding=padding, flip_filter=flip_filter, gain=gain).apply(x, f)
File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/upfirdn2d.py", line 245, in forward
y = _plugin.upfirdn2d(y, f.unsqueeze(1), 1, upy, 1, downy, 0, 0, pady0, pady1, flip_filter, gain)
RuntimeError: output is too large
However, I can run this
python train.py --outdir=./training-runs/styleganxl_training_reduced_128 --cfg=stylegan3-t --data=./data/styleganxl_training_reduced128.zip \
--gpus=2 --batch=32 --mirror=1 --snap 10 --batch-gpu 16 --kimg 10000 --syn_layers 10 --cond True --mirror True --cbase 16384 --cmax 256 --syn_layers 7 \
--superres --up_factor 2 --head_layers 4 \
--path_stem training-runs/styleganxl_training_reduced_64/00000-stylegan3-t-styleganxl_training_reduced64-gpus2-batch176/best_model.pkl
Hi, I am also hitting this same issue. Please share any clues if you've. Thanks.
Hi @BenjiKCF , I was running the training on V100 32 GB node. I decreased the batch size to 1 and it didn't throw the error. I'd suggest to try the same.