pro_gan_pytorch-examples icon indicating copy to clipboard operation
pro_gan_pytorch-examples copied to clipboard

How to train on custom dataset?

Open panovr opened this issue 6 years ago • 53 comments

My dataset has 100 classes, and the directory structure is like:

data/images |-001   |-001_0001.jpg   |-001_0002.jpg   |-... |-002   |-002_0001.jpg   |-002_0002.jpg ...

That is every folder represents a separate class.

May I ask that how to train on this custom dataset?

panovr avatar Dec 31 '18 12:12 panovr

Hi @panovr,

You can use torchvision.datasets.ImageFolder PyTorch utility. Full documetation here. Just provide the argument root = "data/images" in order to create your custom dataset.

Hope this helps.

Best regards, @akanimax

akanimax avatar Jan 01 '19 06:01 akanimax

@akanimax I note that DataLoader has a FoldersDistributedDataset class and a get_data_loader method. Should I use them?

I also note that train_network.py can load a configure file like celeba-hq.conf. For my custom dataset, which cofigure file I can use for reference?

panovr avatar Jan 01 '19 23:01 panovr

@panovr,

Sure, you can use the FoldersDistributedDataset Dataset from the DataLoader.py module. However, I am not aware of what your use case is. Are you training an Unconditional ProGAN or a conditional one? the FoldersDistributedDataset is for Unconditional ProGAN. It will ignore the sub-directories in your dataset which perhaps correspond to your classes. For a Conditional one, you will need to use the torchvision's image dataset as I mentioned earlier. You will not need the get_data_loader as I have moved that functionality to the pro-gan-pth itself. Just create the dataset and you are done :smile:.

For the configuration file, you can reuse any of the configuration files. Although, I'd suggest you create a copy from one of the base configurations and then rename it your use-case. Also please change the paths to your system's dependent paths.

Hope this helps!

Best regards, @akanimax

akanimax avatar Jan 02 '19 04:01 akanimax

@akanimax

  1. Since every folder in my custom dataset represents a separate class, so I think the training is Conditional ProGAN.

  2. By the way, does the training need every image in the different sub-folder has the same dimension?

panovr avatar Jan 02 '19 11:01 panovr

@panovr,

  1. You will need the torchvision.datasets.ImageFolder from Torchvision for conditional dataset.

  2. No, you can have arbitrary sized images in the folder. Just ensure to pass the transform returned by get_transform from DataLoader.py to the ImageFolder instance which resizes the images on the run while training.

Hope this helps.

Best regards, @akanimax

akanimax avatar Jan 02 '19 14:01 akanimax

@akanimax

I will do image resizing with code below:

datadir = os.path.join(args.data, 'images')
dataset = datasets.ImageFolder(
        datadir,
        transforms.Compose([
            transforms.Resize((128,128)),
            transforms.ToTensor(),
        ]))
  • May I ask that how can I determine the ProGAN output dimension? For example, after training, I want to generate 512x512 image.

Thanks!

panovr avatar Jan 03 '19 07:01 panovr

I use pytorch 1.0, and below is my training code:

device = th.device("cuda" if th.cuda.is_available() else "cpu")
data_path = "data/"

def setup_data():
    datadir = os.path.join(data_path, 'images')
    dataset = datasets.ImageFolder(
        datadir,
        transforms.Compose([
            transforms.Resize((128,128)),
            transforms.ToTensor(),
        ]))

    return dataset


if __name__ == '__main__':
    depth = 4
    num_epochs = [10, 20, 20, 20]
    fade_ins = [50, 50, 50, 50]
    batch_sizes = [16, 16, 16, 16]
    latent_size = 128

    dataset = setup_data()

    pro_gan = pg.ConditionalProGAN(num_classes=100, depth=depth, 
                                   latent_size=latent_size, device=device)

    pro_gan.train(
        dataset=dataset,
        epochs=num_epochs,
        fade_in_percentage=fade_ins,
        batch_sizes=batch_sizes
    )

There are some errors during training:

Starting the training process ...


Currently working on Depth:  0
Current resolution: 4 x 4

Epoch: 1
Traceback (most recent call last):
  File "pro.py", line 41, in <module>
    batch_sizes=batch_sizes
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\PRO_GAN.py", line 1046, in train
    labels, current_depth, alpha)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\PRO_GAN.py", line 865, in optimize_discriminator
    labels, depth, alpha)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\Losses.py", line 343, in dis_loss
    real_out = self.dis(real_samps, labels, height, alpha)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\PRO_GAN.py", line 305, in forward
    out = self.final_block(y, labels)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\CustomLayers.py", line 449, in forward
    projection_scores = (y_ * labels).sum(dim=-1)  # [B]
RuntimeError: The size of tensor a (13) must match the size of tensor b (128) at non-singleton dimension 3

panovr avatar Jan 03 '19 22:01 panovr

Hi @panovr,

Apologies for the late reply. I was ladden with work for the last couple of days. The code that you are using has the following two problems: 1st of which is the reason why you are getting the error.

1.) There is a mismatch between the size of highest resolution images and the depth of your network. If you are resizing the images to 128 x 128, then your depth should be depth = 6. Please note the depth calculation starts from 4 x 4. So if you want to generate images of size 32 x 32, your depth should be depth = 4. So please adjust it accordingly.

2.) This is something I noticed which might cause problem in training (although code will run fine). Please also add the Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) as the last transform to your Compose() in order to bring the images in the range [-1, 1].

Hope this helps.

Please let me know if you are still facing any issues.

Best regards, @akanimax

akanimax avatar Jan 04 '19 06:01 akanimax

Hi @akanimax The depth parameter explain is really helpful to me, and I also added Normalize following your suggestion.

I will let you know when the training completed.

Thanks!

panovr avatar Jan 04 '19 07:01 panovr

@panovr, You're welcome :smile:! Glad to be helpful.

Best regards, @akanimax

akanimax avatar Jan 04 '19 07:01 akanimax

@akanimax

The training was completed, and one of the last generated sample: gen_5_30_366

My dataset has 100 classes, and 60 images every class. The images are all birds, but every folder represents different kings of birds. I have some questions:

  1. should I train Unconditional ProGAN instead conditional one in order to generate more high quality image? (I think maybe conditional one is more difficulty to training than unconditional one for my dataset)

  2. For conditional one, current the parameters are:

depth = 6
num_epochs = [30, 30, 30, 30, 30, 30]
fade_ins = [50, 50, 50, 50, 50, 50]
batch_sizes = [16, 16, 16, 16, 16, 16]
latent_size = 128

Maybe I need to adjust some parameters like increase the num_epochs?

  1. What the meaning of the condition in the pro_gan_pytorch package? I mean, like StackGAN, the condition means some words like black bird to guide the generator.

panovr avatar Jan 04 '19 23:01 panovr

@panovr,

Your preliminary results are very promising; given that you used a very small model and also trained for a relatively less amount of time. I believe, you can get this model to work. Please consider the following suggestions:

1.) I think you should first try to train the Conditional one a bit more. If after this experimentation, you find that it doesn't work, you can always fall back to unconditional one.

2.) Please try to make these changes to the hyperparameters: Since you have in all 6000 images in your dataset, try to increase the num_epochs to be in between [200 - 300]; same for every resolution. Fade-in percentages seems fine (50 is what is mentioned in the Paper). Your batch size is also too low. Could you try a batch-size of may be 32 or 64? Finally, using latent_size = 128 spawns a very small model. Please try to use a model with latent_size=512. If you are not able to fit this latent_size model in GPU memory, then try reducing the batch_size to 24 (not less than this). If even this doesn't work, then try with latent_size=256 and a good batch_size.

3.) In this case the images are conditioned on a one-hot vector of the numeric class that the image belongs to. Similar to Cifar-10, MNIST, Image-Net, etc.

Also please refer to this video -> https://www.youtube.com/watch?v=lzTm6Lq76Mo&t=4s for the traits of a proper training. Basically try to check if you are obtaining a sort-of convergence at the end of every resolution's stabilizing iterations. You can accordingly adjust your hyperparameters if needed.

Hope this helps. Please feel free to ask me if you have any more questions.

Best regards, @akanimax

akanimax avatar Jan 05 '19 05:01 akanimax

@akanimax

I made those changes to the hyper-parameters :

  • num_epochs = 256
  • batch_sizes = 32
  • latent_size = 512

I have 2 1080Ti GPUs, but I can only use only one. When I use the code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

There will be some errors.

Any way, this will be a long training process.

panovr avatar Jan 05 '19 11:01 panovr

Hey!

Allow me to get into the discussion. I've been trying to use a custom dataset with the same data structure as @panovr but I havent been able to train successfully.

I get the error:

--------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-2-338b5ae66ade> in <module>
     64         epochs=num_epochs,
     65         fade_in_percentage=fade_ins,
---> 66         batch_sizes=batch_sizes
     67     )
     68     # ======================================================================

~/github/models/pro_gan_pytorch/PRO_GAN.py in train(self, dataset, epochs, batch_sizes, fade_in_percentage, start_depth, num_workers, feedback_factor, log_dir, sample_dir, save_dir, checkpoint_factor)
   1054 
   1055                     # provide a loss feedback
-> 1056                     print(feedback_factor)
   1057                     print(total_batches)
   1058                     if i % int(total_batches / feedback_factor) == 0 or i == 1:

ZeroDivisionError: integer division or modulo by zero

My code is the following:

import torch as th
import torchvision as tv
import pro_gan_pytorch.PRO_GAN as pg

from torchvision import transforms
import torchvision

TRAIN_DATA_PATH = '/home/jovyan/github/models/imagesProcessed/'

# select the device to be used for training
device = th.device("cuda" if th.cuda.is_available() else "cpu")

def setup_data(download=False):
    """
    setup the CIFAR-10 dataset for training the CNN
    :param batch_size: batch_size for sgd
    :param num_workers: num_readers for data reading
    :param download: Boolean for whether to download the data
    :return: classes, trainloader, testloader => training and testing data loaders
    """
    # data setup:
    TRANSFORM_IMG = transforms.Compose([
        transforms.Resize(128),
        #transforms.CenterCrop(256),
        transforms.ToTensor(),
        #transforms.ToPILImage(mode='RGB'),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    trainset = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM_IMG)
    
    testset = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM_IMG)
    
    classes = trainset.classes

    return classes, trainset, testset


if __name__ == '__main__':

    # some parameters:
    depth = 6
    # hyper-parameters per depth (resolution)
    num_epochs = [10, 20, 20, 20, 20, 20]
    fade_ins = [50, 50, 50, 50, 50, 50]
    batch_sizes = [32, 32, 32, 32, 32, 32]
    latent_size = 128

    # get the data. Ignore the test data and their classes
    _, dataset, _ = setup_data(download=True)

    # ======================================================================
    # This line creates the PRO-GAN
    # ======================================================================
    pro_gan = pg.ConditionalProGAN(num_classes=len(dataset.classes), depth=depth, 
                                   latent_size=latent_size, device=device)
    # ======================================================================

    # ======================================================================
    # This line trains the PRO-GAN
    # ======================================================================
    pro_gan.train(
        dataset=dataset,
        epochs=num_epochs,
        fade_in_percentage=fade_ins,
        batch_sizes=batch_sizes
    )
    # ====================================================================== 

I haven't done much changes, just changing the dataset for the imageFolder dataset from pytorch and the training parameters according depth=6.

If possible, could you share a snapshot of your code @panovr ?I would very much like to see how you adapted PRO_GAN to your custom dataset.

jiwidi avatar Jan 07 '19 08:01 jiwidi

Hi @jiwidi,

The default value of feedback_factor is too high in your case. Please try setting it to a lower value such as 10. Please refer to the documentation of the train method to know more about the parameters that it takes and their default values. For the feedback factor, it needs to be less than the number of batches that you train per epoch.

@panovr, Could you share the status and any sample of your current training?

Best regards, @akanimax

akanimax avatar Jan 07 '19 08:01 akanimax

@akanimax

The training was still not completed yet (2 days), but encountered a GPU memory error:

Currently working on Depth:  5
Current resolution: 128 x 128

Epoch: 1

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 11.00 GiB total capacity; 8.43 GiB already allocated; 168.83 MiB free; 160.46 MiB cached)

gen_4_256_184

  • Can I restore the training from the break point?

Thanks!

panovr avatar Jan 08 '19 12:01 panovr

@panovr That means you are allocating too many memory on your gpu memory and you should either reduce batch size or image size.

@akanimax Thanks! I think that fixed it, I can't access now my home computer to run again the code as now it gives me memory errors(not enough ram) but as soon as I can I'll be back with some feedback.

jiwidi avatar Jan 08 '19 14:01 jiwidi

Could someone share their code to generate images? I tried the snippet in the README but I have some problems with the dictionary structure. This is the error with the snippet from the readme

D:\pro_gan_pytorch-examples\implementation>python generate2.py
Traceback (most recent call last):
  File "generate2.py", line 10, in <module>
    th.load("training_runs/mydata/saved_models/GAN_GEN_5.pth")
  File "C:\Users\castle\Envs\progan_ex\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Generator:
        Missing key(s) in state_dict: "initial_block.conv_1.weight", "initial_block.conv_1.bias", "initial_block.conv_2.weight", "initial_block.conv_2.bias", "layers.0.conv_1.weight", "layers.0.conv_1.bias", "layers.0.conv_2.weight", "layers.0.conv_2.bias", "layers.1.conv_1.weight", "layers.1.conv_1.bias", "layers.1.conv_2.weight", "layers.1.conv_2.bias", "layers.2.conv_1.weight", "layers.2.conv_1.bias", "layers.2.conv_2.weight", "layers.2.conv_2.bias", "layers.3.conv_1.weight", "layers.3.conv_1.bias", "layers.3.conv_2.weight", "layers.3.conv_2.bias", "layers.4.conv_1.weight", "layers.4.conv_1.bias", "layers.4.conv_2.weight", "layers.4.conv_2.bias", "rgb_converters.0.weight", "rgb_converters.0.bias", "rgb_converters.1.weight", "rgb_converters.1.bias", "rgb_converters.2.weight", "rgb_converters.2.bias", "rgb_converters.3.weight", "rgb_converters.3.bias", "rgb_converters.4.weight", "rgb_converters.4.bias", "rgb_converters.5.weight", "rgb_converters.5.bias".
        Unexpected key(s) in state_dict: "module.initial_block.conv_1.weight", "module.initial_block.conv_1.bias", "module.initial_block.conv_2.weight", "module.initial_block.conv_2.bias", "module.layers.0.conv_1.weight", "module.layers.0.conv_1.bias", "module.layers.0.conv_2.weight", "module.layers.0.conv_2.bias", "module.layers.1.conv_1.weight", "module.layers.1.conv_1.bias", "module.layers.1.conv_2.weight", "module.layers.1.conv_2.bias", "module.layers.2.conv_1.weight", "module.layers.2.conv_1.bias", "module.layers.2.conv_2.weight", "module.layers.2.conv_2.bias", "module.layers.3.conv_1.weight", "module.layers.3.conv_1.bias", "module.layers.3.conv_2.weight", "module.layers.3.conv_2.bias", "module.layers.4.conv_1.weight", "module.layers.4.conv_1.bias", "module.layers.4.conv_2.weight", "module.layers.4.conv_2.bias", "module.layers.5.conv_1.weight", "module.layers.5.conv_1.bias", "module.layers.5.conv_2.weight", "module.layers.5.conv_2.bias", "module.layers.6.conv_1.weight", "module.layers.6.conv_1.bias", "module.layers.6.conv_2.weight", "module.layers.6.conv_2.bias", "module.rgb_converters.0.weight", "module.rgb_converters.0.bias", "module.rgb_converters.1.weight", "module.rgb_converters.1.bias", "module.rgb_converters.2.weight", "module.rgb_converters.2.bias", "module.rgb_converters.3.weight", "module.rgb_converters.3.bias", "module.rgb_converters.4.weight", "module.rgb_converters.4.bias", "module.rgb_converters.5.weight", "module.rgb_converters.5.bias", "module.rgb_converters.6.weight", "module.rgb_converters.6.bias", "module.rgb_converters.7.weight", "module.rgb_converters.7.bias".

I then tried to wrap it in DataParallel using this code

import torch as th
import pro_gan_pytorch.PRO_GAN as pg
import matplotlib.pyplot as plt

device = th.device("cpu")
gen = th.nn.DataParallel(pg.Generator(depth=5))
gen.load_state_dict(th.load("training_runs/mydata/saved_models/GAN_GEN_4.pth", map_location=str(device)))

noise = th.randn(1, 256).to(device)

sample_image = gen(noise, detph=5, alpha=1).detach()

plt.imshow(sample_image[0].permute(1, 2, 0) / 2 + 0.5)
plt.show()

But I am not sure how to set the latent_size which I set during training to 256 The error I get is this:

D:\pro_gan_pytorch-examples\implementation>python generate.py
Traceback (most recent call last):
  File "generate.py", line 8, in <module>
    gen.load_state_dict(th.load("training_runs/mydata/saved_models/GAN_GEN_4.pth", map_location=str(device)))
  File "C:\Users\castle\Envs\progan_ex\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
        Unexpected key(s) in state_dict: "module.layers.4.conv_1.weight", "module.layers.4.conv_1.bias", "module.layers.4.conv_2.weight", "module.layers.4.conv_2.bias", "module.layers.5.conv_1.weight", "module.layers.5.conv_1.bias", "module.layers.5.conv_2.weight", "module.layers.5.conv_2.bias", "module.layers.6.conv_1.weight", "module.layers.6.conv_1.bias", "module.layers.6.conv_2.weight", "module.layers.6.conv_2.bias", "module.rgb_converters.5.weight", "module.rgb_converters.5.bias", "module.rgb_converters.6.weight", "module.rgb_converters.6.bias", "module.rgb_converters.7.weight", "module.rgb_converters.7.bias".
        size mismatch for module.initial_block.conv_1.weight: copying a param with shape torch.Size([256, 256, 4, 4]) from checkpoint, the shape in current model is torch.Size([512, 512, 4, 4]).
        size mismatch for module.initial_block.conv_1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.initial_block.conv_2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.initial_block.conv_2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.0.conv_1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.layers.0.conv_1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.0.conv_2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.layers.0.conv_2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.1.conv_1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.layers.1.conv_1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.1.conv_2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.layers.1.conv_2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.2.conv_1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.layers.2.conv_1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.2.conv_2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
        size mismatch for module.layers.2.conv_2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for module.layers.3.conv_1.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 512, 3, 3]).
        size mismatch for module.layers.3.conv_1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for module.layers.3.conv_2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
        size mismatch for module.layers.3.conv_2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for module.rgb_converters.0.weight: copying a param with shape torch.Size([3, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 512, 1, 1]).
        size mismatch for module.rgb_converters.1.weight: copying a param with shape torch.Size([3, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 512, 1, 1]).
        size mismatch for module.rgb_converters.2.weight: copying a param with shape torch.Size([3, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 512, 1, 1]).
        size mismatch for module.rgb_converters.3.weight: copying a param with shape torch.Size([3, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 512, 1, 1]).
        size mismatch for module.rgb_converters.4.weight: copying a param with shape torch.Size([3, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 256, 1, 1]).

(progan_ex) D:\pro_gan_pytorch-examples\implementation>

Ianmcmill avatar Jan 24 '19 14:01 Ianmcmill

@Ianmcmill,

As you said, there is a mismatch in your latent_size while training and while restoring. Refer to the documentation of Generator module here.

You just need to set latent_size in your code like the following:

gen = th.nn.DataParallel(pg.Generator(depth=5, latent_size=256))

This should solve your problem.

Also, please do share your results too; I'd be happy to feature them on the Repo.

Best regards, @akanimax

akanimax avatar Jan 25 '19 05:01 akanimax

@akanimax @jiwidi I think GTX 1080 Ti with 11 GB memory still can't support to train on my custom dataset. :disappointed: I will try to decrease batch_sizes to 16 and try again.

panovr avatar Jan 25 '19 06:01 panovr

@akanimax @jiwidi I think GTX 1080 Ti with 11 GB memory still can't support to train on my custom dataset. 😞 I will try to decrease batch_sizes to 16 and try again.

You can also try to train your network with a smaller latent_size. This is a trade off of course. But then you can keep your batch_size considerably higher. Google Colab features free Tensor Processing Units (TPU). It would be nice to see some GANs ported to TPU architecture. PyTorch is making progress in TPU compatibility. Keras should also be useable. AFAIK one TPU has 32gb. And the higher the batch_size the faster the processing speed. So your batch_size 16 becomes 16*8.

Ianmcmill avatar Jan 25 '19 08:01 Ianmcmill

Tried this

import torch as th
import pro_gan_pytorch.PRO_GAN as pg
import matplotlib.pyplot as plt

device = th.device("cuda" if th.cuda.is_available() else "cpu")
gen = th.nn.DataParallel(pg.Generator(depth=6, latent_size=256))
# gen = pg.Generator(depth=6, latent_size=256, use_eql=False).to(device)
gen.load_state_dict(th.load("training_runs/mydata/saved_models/GAN_GEN_5.pth"))

noise = th.randn(1, 128).to(device)

sample_image = gen(noise, depth=6, alpha=1).detach()

plt.imshow(sample_image[0].permute(1, 2, 0) / 2 + 0.5)
plt.show()

But still get an Unexpected keys error

Traceback (most recent call last):
  File "generate2.py", line 18, in <module>
    gen.load_state_dict(new_state_dict)
  File "C:\Users\castle\Envs\progan_ex\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
        Missing key(s) in state_dict: "module.initial_block.conv_1.weight", "module.initial_block.conv_1.bias", "module.initial_block.conv_2.weight", "module.initial_block.conv_2.bias", "module.layers.0.conv_1.weight", "module.layers.0.conv_1.bias", "module.layers.0.conv_2.weight", "module.layers.0.conv_2.bias", "module.layers.1.conv_1.weight", "module.layers.1.conv_1.bias", "module.layers.1.conv_2.weight", "module.layers.1.conv_2.bias", "module.layers.2.conv_1.weight", "module.layers.2.conv_1.bias", "module.layers.2.conv_2.weight", "module.layers.2.conv_2.bias", "module.rgb_converters.0.weight", "module.rgb_converters.0.bias", "module.rgb_converters.1.weight", "module.rgb_converters.1.bias", "module.rgb_converters.2.weight", "module.rgb_converters.2.bias", "module.rgb_converters.3.weight", "module.rgb_converters.3.bias".

In the pytorch forum it is suggested to remove the module prefix https://discuss.pytorch.org/t/solved-keyerror-unexpected-key-module-encoder-embedding-weight-in-state-dict/1686/4 I am afraid I am not proficient enough to do this.

Ianmcmill avatar Jan 25 '19 09:01 Ianmcmill

Also when testing demo.py from https://github.com/akanimax/pro_gan_pytorch and modifying gen = th.nn.DataParallel(pg.Generator(depth=5, latent_size=256)) I get the unexpected keys error. Why does it work on your machine and why is it not replicable on other setups? ie. Singel GPU?

Edit I also tried the pretrained models from the drive link in pro_gan_pytorch and there I get the same unexpected key error.

Ianmcmill avatar Jan 25 '19 11:01 Ianmcmill

I started a test training with just one epoch. I tried to remove multi GPU support and therefore commented the DataParallel in PRO_GAN.py

        # if code is to be run on GPU, we can use DataParallel:
        #if device == th.device("cuda"):
        #    self.gen = DataParallel(self.gen)
        #    self.dis = DataParallel(self.dis)

and generate samples with "generate2.py"

import torch as th
import pro_gan_pytorch.PRO_GAN as pg
import matplotlib.pyplot as plt

device = th.device("cuda:0" if th.cuda.is_available() 
                   else "cpu")
gen = pg.Generator(depth=1, latent_size=128, 
                   use_eql=False).to(device)

gen.load_state_dict(
    th.load("training_runs/portrait2/saved_models/GAN_GEN_0.pth")
)

noise = th.randn(1, 128).to(device)

sample_image = gen(noise, depth=1, alpha=1).detach()

plt.imshow(sample_image[0].permute(1, 2, 0) / 2 + 0.5)
plt.show()

But get this error python generate2.py

Traceback (most recent call last):
  File "generate2.py", line 11, in <module>
    th.load("training_runs/portrait2/saved_models/GAN_GEN_0.pth")
  File "C:\Users\castle\Envs\progan_ex\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Generator:
        Unexpected key(s) in state_dict: "layers.0.conv_1.weight", "layers.0.conv_1.bias", "layers.0.conv_2.weight", "layers.0.conv_2.bias", "layers.1.conv_1.weight", "layers.1.conv_1.bias", "layers.1.conv_2.weight", "layers.1.conv_2.bias", "layers.2.conv_1.weight", "layers.2.conv_1.bias", "layers.2.conv_2.weight", "layers.2.conv_2.bias", "layers.3.conv_1.weight", "layers.3.conv_1.bias", "layers.3.conv_2.weight", "layers.3.conv_2.bias", "layers.4.conv_1.weight", "layers.4.conv_1.bias", "layers.4.conv_2.weight", "layers.4.conv_2.bias", "layers.5.conv_1.weight", "layers.5.conv_1.bias", "layers.5.conv_2.weight", "layers.5.conv_2.bias", "rgb_converters.1.weight", "rgb_converters.1.bias", "rgb_converters.2.weight", "rgb_converters.2.bias", "rgb_converters.3.weight", "rgb_converters.3.bias", "rgb_converters.4.weight", "rgb_converters.4.bias", "rgb_converters.5.weight", "rgb_converters.5.bias", "rgb_converters.6.weight", "rgb_converters.6.bias".

I try to generate samples on my local GTX 980 from a network I have partially trained (around 62h as of the time of writing) on a single Google Colab K80 GPU. The generated smaples during the training process proof that samples can be generated. But I don't understand how generate them. The hyper-parameters are the following:

# Hyperparameters for the Model
img_dims:
  - 512
  - 512

# Pro GAN hyperparameters
use_eql: True
depth: 8
latent_size: 512
learning_rate: 0.001
beta_1: 0
beta_2: 0.99
eps: 0.00000001
drift: 0.001
n_critic: 1
use_ema: True
ema_decay: 0.999

# Training hyperparameters:
epochs:
  - 100
  - 100
  - 100
  - 90
  - 80
  - 64
  - 64
  - 64

Ianmcmill avatar Jan 27 '19 12:01 Ianmcmill

I managed to get it generating.

  1. The latent animation demo In demo.py from progan_pytorch I changed the following and made some comments
depth = 3
num_points = 10
transition_points = 1
# ==========================================================================

# create the device for running the demo:
device = th.device("cuda" if th.cuda.is_available() else "cpu")

# load the model for the demo
# depth= must be equal to the final depth set in .conf file, even if it hasn't been trained so far.
# latent_size= must be equal to the latent_size set in .conf file
gen = th.nn.DataParallel(pg.Generator(depth=7, latent_size=512))
gen.load_state_dict(th.load("training_runs/portrait/saved_models/GAN_GEN_3.pth"))


# function to generate an image given a latent_point
def get_image(point):
    img = gen(point, depth=depth, alpha=1).detach().squeeze(0).permute(1, 2, 0)
    img = (img - img.min()) / (img.max() - img.min())
    return img.cpu().numpy()


# generate the set of points:
# th.randn last number must be equal to latent_size set in .conf file
fixed_points = th.randn(num_points, 512).to(device)
  1. Generating of single images. This code is ugly I think.
import torch as th
import pro_gan_pytorch.PRO_GAN as pg
import torchvision.utils as vutils



device = th.device("cuda" if th.cuda.is_available() 
                   else "cpu")
gen = th.nn.DataParallel(pg.Generator(depth=8, latent_size=512))
gen.load_state_dict(th.load("training_runs/portrait/saved_models/GAN_GEN_5.pth", map_location=str(device)))


for x in range(10): 
    noise = th.randn(1, 512).to(device)
    sample_image = gen(noise, depth=5, alpha=1).detach()
    vutils.save_image(sample_image[0, :, :, :], 'portrait_' + str(x) + '.png'.format(3))

I generated samples during the training of the model and realized some oddities. If you train a network with a final depth of 8 and stop training during depth 4 and you use the last code in here to generate a sample from depth 4, the images generated have false colors. However if you generate a sample from GAN_GEN_4.pth with depth=3 you get images with the correct color. Or at least the color the model things it should have.

I am currently training a progan with a 10k image dataset on Google Colab K80 12gb These are the parameters

# Hyperparameters for the Model
img_dims:
  - 512
  - 512

# Pro GAN hyperparameters
use_eql: True
depth: 8
latent_size: 512
learning_rate: 0.001
beta_1: 0
beta_2: 0.99
eps: 0.00000001
drift: 0.001
n_critic: 1
use_ema: True
ema_decay: 0.999

# Training hyperparameters:
epochs:
  - 64
  - 64
  - 64
  - 80
  - 80
  - 80
  - 80
  - 80

Currently it's training on depth 5 128x128 on the 16th epoch. Each epoch takes 40 minutes. Each Colab instance runs around 10h before it gets resetted by Google. To complete depth 5 it would need around 5 days when considering sleeping, restarting the instance, waiting, sleeping, restarting the instance. I wonder how long it takes for depth 6 256x256.

Ianmcmill avatar Jan 30 '19 13:01 Ianmcmill

Hi Guys! First of all apologies for this late reply. I was travelling during all this time, so couldn't get back. I'll try to address all the prior messages:

1.) @panovr

I think GTX 1080 Ti with 11 GB memory still can't support to train on my custom dataset. disappointed I will try to decrease batch_sizes to 16 and try again.

Could you please try to pull the latest package version using pip install -U pro-gan-pth? Apparently there was a memory leak in the previous version. Refer here for more info. It is highly unlikely that GTX 1080 can't fit a batch size of 32 for a 256 x 256 resolution.

2.) @Ianmcmill I apologize that you had to figure it all on your own (I would have loved to help). Nevertheless, you are correct about most of the parts. I'll just clarify a few things:

# load the model for the demo # depth= must be equal to the final depth set in .conf file, even if it hasn't been trained so far. # latent_size= must be equal to the latent_size set in .conf file

That's absolutely right!

I generated samples during the training of the model and realized some oddities. If you train a network with a final depth of 8 and stop training during depth 4 and you use the last code in here to generate a sample from depth 4, the images generated have false colors.

Well, this isn't fully correct. I'll point a couple of things out: firstly, note that the depth provided for creating a model is indexed from 1, so a pg.Generator(depth=8) creates a 512 x 512 final resolution model; while the depth used for generating images is indexed from 0, thus to generate 512 x 512 images from the same same model, you need to use: gen(point, depth=7, alpha=1).

Secondly, the color oddity is a result of stopping the training either during the fade-in process, or very early in the stabilization iterations. Refer this video for the same. Notice how the color gets thrown out of range during fade-in and then again deepens during stabilization.

I am currently training a progan with a 10k image dataset on Google Colab K80 12gb These are the parameters

This is really cool. Perhaps, you could contribute to the README.md on how to run this code on Google Colab. I personally have never done that.

Lastly, @panovr, @Ianmcmill, If you could paste some of the generated samples in your training till now in this thread, I could surely try and help regarding tweaking the training too.

Best regards, @akanimax

akanimax avatar Jan 31 '19 05:01 akanimax

@akanimax No problem. The code for generating single images from a trained model is not very convenient. However this is what my knowledge in Python allowed me to do. Adding argsparse would be nice for easier use. Options like load_snapshot sample_amount are helpful. I first have to learn how to use argparse and apply some modularity to the code to make it nicer.

Ianmcmill avatar Jan 31 '19 12:01 Ianmcmill

@akanimax The sample below was generated with these settings:

depth = 6
num_epochs = [256, 256, 256, 256, 256, 256]
fade_ins = [50, 50, 50, 50, 50, 50]
batch_sizes = [16, 16, 16, 16, 16, 16]
latent_size = 512

gen_5_256_224

I started trying the latest progan package with batch_size = 32, and waiting...

panovr avatar Feb 02 '19 07:02 panovr

@akanimax There were some errors with this setting:

 depth = 7
 num_epochs = [256, 256, 256, 256, 256, 256, 256]
 fade_ins = [50, 50, 50, 50, 50, 50, 50]
 batch_sizes = [32, 32, 32, 32, 32, 32, 32]
 latent_size = 512
Starting the training process ...

Currently working on Depth:  0
Current resolution: 4 x 4

Epoch: 1
Traceback (most recent call last):
  File "pro.py", line 42, in <module>
    batch_sizes=batch_sizes
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\PRO_GAN.py", line 1046, in train
    labels, current_depth, alpha)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\PRO_GAN.py", line 865, in optimize_discriminator
    labels, depth, alpha)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\Losses.py", line 346, in dis_loss
    real_out = self.dis(real_samps, labels, height, alpha)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\PRO_GAN.py", line 305, in forward
    out = self.final_block(y, labels)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\CustomLayers.py", line 442, in forward
    y = self.lrelu(self.conv_2(y))  # [B x C x 1 x 1]
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pro_gan_pytorch\CustomLayers.py", line 52, in forward
    padding=self.pad)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

Change depth=6 seems to work.

panovr avatar Feb 02 '19 08:02 panovr

@panovr RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM This occurs when the img_dims: mismatch depth set in your config. Example: Error:

img_dims:
  - 128
  - 128
depth: 7

Correct:

img_dims:
  - 256
  - 256
depth: 7

Or this way: For img_dims: 64x64 you would need to set the depth to 5.

Ianmcmill avatar Feb 02 '19 09:02 Ianmcmill