pi-GAN-pytorch icon indicating copy to clipboard operation
pi-GAN-pytorch copied to clipboard

Any update on completing this?

Open mkumar10 opened this issue 3 years ago • 15 comments

Seems to be missing the training code

mkumar10 avatar Jan 27 '21 06:01 mkumar10

I'm working on equivariant attention for alphafold2 atm, will get back to this!

lucidrains avatar Jan 29 '21 06:01 lucidrains

@mkumar10 hey! I hooked up the training code today :) i haven't really sampled images to see if it is working, but at least everything runs

lucidrains avatar Feb 06 '21 21:02 lucidrains

hmm, it's not working for me :( I'll try to debug it next weekend

lucidrains avatar Feb 07 '21 00:02 lucidrains

Hi Phil,

Thank you for working on this!

I attempted to train the model v0.11 and found that it causes memory issues with a 16 gb GPU. I used the default setup (image_size = 128, dim = 512 ) and a very small dataset (~dozen imgs) to check if it runs. Did you run into similar issues during testing?

What data do you use for your training?

janhenrikbern avatar Feb 25 '21 15:02 janhenrikbern

@janhenrikbern yeah, I've faced the same issue, when trying running it on colab with 16gb gpu. Even reduction to image_size=64, dim=128 doesn't help, and even it seems that it changes nothing. I have the same estimate for training time ~30 hours, and then on 80th iteration the training process crashes.

Godofnothing avatar Feb 26 '21 16:02 Godofnothing

@janhenrikbern try to decrease the size of batch, default_batch_size = 8 would not fit the 16gb GPU, I've decreased batch size to 2 and now I occupy 9.5gb of memory.

Godofnothing avatar Feb 28 '21 09:02 Godofnothing

@Godofnothing Thanks for the update, trying this myself now!

Were you able to learn anything?

janhenrikbern avatar Feb 28 '21 14:02 janhenrikbern

@janhenrikbern unfortunately after 10000 iteration training terminated again, one needs to keep in mind, that the resolution is increased gradually from 32 to 128, therefore the batches become more heavy. I'll try to rerun it again.

Godofnothing avatar Feb 28 '21 15:02 Godofnothing

@janhenrikbern any attempt to run a pi-GAN with the image resolution up to 128 fails even with the smallest size of batch=1. If one restricts himself to small final resolutions - the training does not terminate. However, I have not tested on good enough data - so maybe it is the reason, why I did not succeed in getting meaningful images. Also, the thing confusing me is that the generator loss can be negative, however the loss needs to be non-negative number. I am trying to make my own implementation with PyTorch Lightning with AMP in the Trainer. Maybe it will work. Nevertheless, such a memory consumption seems strange, because in the original paper authors managed to train the network with the batches of size 120 at the initial stage, with decrease to 12 on the highest 128 x 128 resolution. They had 2 RTX6000 which is in total 48 Gb of memory, so in principle it seems, that one should be able to operate with the batch of size 3 or 4 with the highest resolution.

Godofnothing avatar Mar 02 '21 06:03 Godofnothing

Hello people, Anyone successfully got some results (even with a small batch size)? Thanks

krips89 avatar Mar 09 '21 19:03 krips89

@krips89 we have created our own version on PyTorch Lightning an it run succesfully, however, with our computing resources the obtained quality was poor

Godofnothing avatar Mar 12 '21 10:03 Godofnothing

@janhenrikbern @krips89 actually, I've found a problem in this implementation. When performing accumulation of the generator and discriminator loss, grads of the loss are not detached - hence, stored in the memory.

Godofnothing avatar Mar 15 '21 20:03 Godofnothing

Thank you for keeping us in the look @Godofnothing! Did detaching fix the issue for you?

janhenrikbern avatar Mar 26 '21 01:03 janhenrikbern

@janhenrikbern The adapted version in PyTorch lighting implemented by me works - but the results are not satisfactory. Actually, as mentioned in other issues, there is some discrepancy between the implementation - and the architecture, described in the original paper - like FiLM conditioning. I've contacted the authors of the paper - and they say, that they will release their code rather soon, after fixing some issues and tidying the code. I would recommend to wait for the official release

Godofnothing avatar Mar 26 '21 09:03 Godofnothing

Has anyone successfully reproduced the result? Thanks~ :)

Tianhang-Cheng avatar May 21 '21 08:05 Tianhang-Cheng