stylegan2-pytorch Training slows to a halt after iteration 5000

Training slows to a halt after iteration 5000

Open cr458 opened this issue 3 years ago • 8 comments

I've observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?

CentOS, pytorch 1.7, training on 8 RTX 6000s with the command

stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1

nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?

Dec 18 '20 15:12 cr458

Just as counterpoint data: I've been running for a couple of days and am at iteration 22000 with no slowdown. (Windows 10, pytorch 1.7, cuda 11.2, NVidia 1080ti)

Dec 18 '20 16:12 canadaduane

@canadaduane thanks for your input! Are you using attn-layers or --aug-prob?

Dec 18 '20 16:12 cr458

aug-prob, yes:

stylegan2_pytorch --data [datadir] --name headshots --results_dir results --models_dir models \
  --aug-prob 0.25 --top-k-training --image-size 256

Dec 18 '20 19:12 canadaduane

Something else that might possibly be relevant: I have 32 GB of RAM.

Dec 18 '20 19:12 canadaduane

I just had a weird occurrence: At around 39000 iterations, progress stopped. By all indications, it seemed like it was still "working on something" but it couldn't get past its current iter (usually it takes about 5 secs per iteration on my hardware, but it had been stuck here for about half an hour). So I hit Ctrl-C in the ananconda window, thinking to restart it. Instead, the Ctrl-C seems to have nudged it into action again. I'm not sure how or why? But in any case, it is crunching again.

Dec 19 '20 21:12 canadaduane

Just experienced this issue, single GPU setting. PyTorch 1.8a, Fedora, training on an RTX 3090 with the command stylegan2_pytorch --data ./jpg --image-size 128 --attn-layers 1 --network-capacity 64 --batch-size 8 was getting ~4.65s/it before iteration #5000 and then it basically slowed after that with ~162s/it

Still getting full gpu utilization, lots of memory left over on system and GPU. The only weird thing is that stylegan is running 100% on a single core and nothing else.

Something I'll note is that I did have to slightly modify my torch installation to disable using jit in kornea as there's some bug there.

Don't have any other insight just wanted to post the replication at this iteration.

Jan 09 '21 16:01 smkravec

I've observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?

CentOS, pytorch 1.7, training on 8 RTX 6000s with the command

stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1

nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?

Its getting slow because of path length regularization. I've observed it too in single-gpu setting. Try to add --no_pl_reg True.

Feb 03 '21 01:02 pjh95

Adding -no_pl_reg True fixed the problem, thank you.

I did a quick search and found path penalty kick in after step 5000 indeed. apply_path_penalty = not self.no_pl_reg and self.steps > 5000 and self.steps % 32 == 0 The problem is, will turning off path penalty causes negative effect to the training result...? No matter what, I'm turning it off because the training speed is too slow.

Feb 09 '21 10:02 chiwing4

stylegan2-pytorch stylegan2-pytorch copied to clipboard

Training slows to a halt after iteration 5000

stylegan2-pytorch
stylegan2-pytorch copied to clipboard