stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

Training slows to a halt after iteration 5000

Open cr458 opened this issue 3 years ago • 8 comments

I've observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?

CentOS, pytorch 1.7, training on 8 RTX 6000s with the command

stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1

nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?

cr458 avatar Dec 18 '20 15:12 cr458

Just as counterpoint data: I've been running for a couple of days and am at iteration 22000 with no slowdown. (Windows 10, pytorch 1.7, cuda 11.2, NVidia 1080ti)

canadaduane avatar Dec 18 '20 16:12 canadaduane

@canadaduane thanks for your input! Are you using attn-layers or --aug-prob?

cr458 avatar Dec 18 '20 16:12 cr458

aug-prob, yes:

stylegan2_pytorch --data [datadir] --name headshots --results_dir results --models_dir models \
  --aug-prob 0.25 --top-k-training --image-size 256

canadaduane avatar Dec 18 '20 19:12 canadaduane

Something else that might possibly be relevant: I have 32 GB of RAM.

canadaduane avatar Dec 18 '20 19:12 canadaduane

I just had a weird occurrence: At around 39000 iterations, progress stopped. By all indications, it seemed like it was still "working on something" but it couldn't get past its current iter (usually it takes about 5 secs per iteration on my hardware, but it had been stuck here for about half an hour). So I hit Ctrl-C in the ananconda window, thinking to restart it. Instead, the Ctrl-C seems to have nudged it into action again. I'm not sure how or why? But in any case, it is crunching again.

canadaduane avatar Dec 19 '20 21:12 canadaduane

Just experienced this issue, single GPU setting. PyTorch 1.8a, Fedora, training on an RTX 3090 with the command stylegan2_pytorch --data ./jpg --image-size 128 --attn-layers 1 --network-capacity 64 --batch-size 8 was getting ~4.65s/it before iteration #5000 and then it basically slowed after that with ~162s/it

Still getting full gpu utilization, lots of memory left over on system and GPU. The only weird thing is that stylegan is running 100% on a single core and nothing else.

Something I'll note is that I did have to slightly modify my torch installation to disable using jit in kornea as there's some bug there.

Don't have any other insight just wanted to post the replication at this iteration.

smkravec avatar Jan 09 '21 16:01 smkravec

I've observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?

CentOS, pytorch 1.7, training on 8 RTX 6000s with the command

stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1

nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?

Its getting slow because of path length regularization. I've observed it too in single-gpu setting. Try to add --no_pl_reg True.

pjh95 avatar Feb 03 '21 01:02 pjh95

Adding -no_pl_reg True fixed the problem, thank you.

I did a quick search and found path penalty kick in after step 5000 indeed. apply_path_penalty = not self.no_pl_reg and self.steps > 5000 and self.steps % 32 == 0 The problem is, will turning off path penalty causes negative effect to the training result...? No matter what, I'm turning it off because the training speed is too slow.

chiwing4 avatar Feb 09 '21 10:02 chiwing4