edge-connect copied to clipboard
Using the pre-trained model to continue training
Hi Kamyar,
I am wondering if it is possible to use the trained model by you as pre-trained model and continue to train with my data. if so, how to update the pre-trained model? Is it same as the training part in your instruction document? Because now I still can't tackle that 1000 iterations problem. (it just got stuck there and would not go beyond the 1001st iteration. I'm thinking that using your existing model and then use my data to continue to train for 999 iterations may work better).
@Yaqiongchai Using the pre-trained model is always preferred over training from scratch. Training the network using the pre-trained model is as easy as copying the weights in your checkpoints folder!
I'm still not sure why your model does not go beyond 1000 iterations. Did you set MAX_ITERS
to a value larger than 1000? How big is your dataset?
@Yaqiongchai Using the pre-trained model is always preferred over training from scratch. Training the network using the pre-trained model is as easy as copying the weights in your checkpoints folder! I'm still not sure why your model does not go beyond 1000 iterations. Did you set
to a value larger than 1000? How big is your dataset?
Thanks for your help! @knazeri I just copied your pre-trained model in /checkpoints/ folder in the hope that training model can pick it up and continue training. However, I encounter the same Epoch1 problem again. No matter how I change batch-size, or num_workers, it won't work and the model was not modified at all. I am wondering what I can do at this point. I am still at stage 1.
Does it mean the model stops training (freezes) after the first epoch? Or it actually ends with a message "End training"?
Does it mean the model stops training (freezes) after the first epoch? Or it actually ends with a message "End training"?
It actually ends with a message "Tend training".
here's a screenshot (same as the previous issue
@Yaqiongchai I see your model starts training and then finishes right away. I believe there should be a minor problem with your dataset path. That means the following for loop is never executed: https://github.com/knazeri/edge-connect/blob/97c28c62ac54a59212cc9db4e78f36c5436c0b72/src/edge_connect.py#L95
You can make sure this is the case by printing the number of images in the training set, copy this line of code at the beginning of the train
If it prints out zero, then you might want to double check your flie-list and/or dataset path!
Here's what I added:
And here's what have reported:
I don't think the problem is dataset path/filelist. Because I tried to rm all the *.pth in the check points in the checkpoint folder, and the program can run without a problem, up to around 50 epoches and 999 iterations to stop (end of training) and save the weights. The problem of my training (999 iterations) is that it does not save generator and dis separately, only save out one .dat file. I guess it is because the training iteration is too small, giving the small learning rate, the system is undertrained?
@Yaqiongchai It does not save the model because in your configuration the SAVE_INTERVAL
is set to 1000! That means training stops (after 999 iterations) before having the chance to save the model. Change the value of SAVE_INTERVAL
to a smaller value and you get your model saved!
@Yaqiongchai It does not save the model because in your configuration the
is set to 1000! That means training stops (after 999 iterations) before having the chance to save the model. Change the value ofSAVE_INTERVAL
to a smaller value and you get your model saved! @knazeri Thanks for your advice! It's great observation. One problem shot, what do you think of the training ends at 1 epoch? 8 is just the batch size though.
@Yaqiongchai No matter how I calculate it, it shouldn't be 1 epoch! Your snapshot shows that the size of the dataset is 72 while the batch size is 8, which means 9 iterations per epoch. 999 iterations leave 111 epochs. Am I missing a point?
Can you please post your exact dataset size, and all the contents in the config.yml
file here?
`MODE: 1 # 1: train, 2: test, 3: eval MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half) EDGE: 1 # 1: canny, 2: external NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny SEED: 10 # random seed GPU: [0] # list of gpu ids DEBUG: 0 # turns on debugging mode VERBOSE: 0 # turns on verbose mode in the output console
TRAIN_FLIST: ./datasets/m2d_train.flist VAL_FLIST: ./datasets/places2_val.flist TEST_FLIST: ./datasets/places2_test.flist
TRAIN_EDGE_FLIST: ./datasets/m2d_train.flist VAL_EDGE_FLIST: ./datasets/places2_edges_val.flist TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist
TRAIN_MASK_FLIST: ./datasets/masks2nd_train.flist VAL_MASK_FLIST: ./datasets/masks_val.flist TEST_MASK_FLIST: ./datasets/masks_test.flist
LR: 0.0001 # learning rate D2G_LR: 0.1 # discriminator/generator learning rate ratio BETA1: 0.0 # adam optimizer beta1 BETA2: 0.9 # adam optimizer beta2 BATCH_SIZE: 8 # input batch size for training INPUT_SIZE: 256 # input image size for training 0 for original size SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no edge) MAX_ITERS: 3999 # maximum number of iterations to train the model
EDGE_THRESHOLD: 0.5 # edge detection threshold L1_LOSS_WEIGHT: 1 # l1 loss weight FM_LOSS_WEIGHT: 10 # feature-matching loss weight STYLE_LOSS_WEIGHT: 250 # style loss weight CONTENT_LOSS_WEIGHT: 1 # perceptual loss weight INPAINT_ADV_LOSS_WEIGHT: 0.01 # adversarial loss weight
GAN_LOSS: nsgan # nsgan | lsgan | hinge GAN_POOL_SIZE: 0 # fake images pool size
SAVE_INTERVAL: 10 # how many iterations to wait before saving model (0: never) SAMPLE_INTERVAL: 10 # how many iterations to wait before sampling (0: never) SAMPLE_SIZE: 12 # number of images to sample EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never) `
Here's my config.yml file. I have 72 images, all are 256256 and 72masks with size of 256256.
@knazeri The size of the image is 256 by 256, same as mask file.
I have the same problem and after first training epoch( for all 3 datasets which I checked with print(len(self.train_dataset))
command and they contain thousands of images) training ends.
Does it matter that datasets are in another drive?
@Yaqiongchai The problem is with your validation set path! You need to also provide a validation set path using VAL_FLIST
. These flags are set to default and there was an infinite loop with a sampler that caused the model to stop! I have fixed the code to prevent the infinite loop, but you should also include a validation set path.
Also, two minor issues in your configuration: your values for CONTENT_LOSS_WEIGHT
are not what we trained our models with. They should be 0.1
and 0.1
@aryan461 I guess you might have had the same problem! Let me know if this also resolves your issue!
@knazeri it was because of self.iteration = data['iteration']
in models.py.
It makes the iteration = 2000000 which is equal to MAX_ITERS
. I changed the MAX_ITERS
in config file. Sorry for my mistake.
@knazeri it was because of
self.iteration = data['iteration']
in models.py. It makes the iteration = 2000000 which is equal toMAX_ITERS
. I changed theMAX_ITERS
in config file. Sorry for my mistake. @aryan461 Could you be more specific? Do we need to change anything in models.py?
@Yaqiongchai I don't think @aryan461 issue applies to yours. He was using the pre-trained model which was already trained to 2,000,000 iterations. Your problem was not having a valid validation set path in your configuration file. However, even if you decide not to have a validation set, I have fixed the code so that it would not freeze. You just need to pull the source!
@Yaqiongchai I don't think @aryan461 issue applies to yours. He was using the pre-trained model which was already trained to 2,000,000 iterations. Your problem was not having a valid validation set path in your configuration file. However, even if you decide not to have a validation set, I have fixed the code so that it would not freeze. You just need to pull the source!
@knazeri I am training now with modified VAL_FLIST, and of course data in the list. It seems to get stuck at the first epoch and would not move on. I also set CONTENT_LOSS_WEIGHT: 0.1 and INPAINT_ADV_LOSS_WEIGHT: 0.1 as you mentioned above.
Also, as long as I set MASK: 3 and EDGE: 1, TRAIN_EDGE_FLIST, VAL_EDGE_FLIST, and TEST_EDGE_FLIST would not matter, right? I am trying it out on both just training my data, and pick up the pre_trained model that I downloaded from your google drive. It does not seem to run smoothly.
Lastly, I'd like to add one line to tell me that the code is picking up the previously trained model and gonna continue to train, in models.py:
if torch.cuda.is_available(): data = torch.load(self.gen_weights_path) else: data = torch.load(self.gen_weights_path, map_location=lambda storage, loc: storage) print(self.gen_weights_path) print(self.dis_weights_path) self.generator.load_state_dict(data['generator']) self.iteration = data['iteration']
Would it be the correct way to do it? Sorry to throw so many questions at you.
Hey kamyar,
I fixed the validation dataset, loss weight, and iterations, however I still see this" Training epoch 1" ending. And as I check ls -trl in my checkpoints folder, *.pth file was not updated. I guess it can successfully pick up the generator and discriminator, but can not continue training.
Good news is that when I rm *.pth files in the checkpoints folder, it can train smoothly, ends exactly at 111th epoch, as you calculated for me before (Kudos!) That's being said, I can have my own model, but still am seeking for a way to use your pre-trained model.
`---------- 2019-03-13 20:31:12 ---------
Wed Mar 13 20:31:13 PDT 2019 Now start training on stage 1: inpaint model training Loading EdgeModel generator... iteration number is: 2000000 Loading EdgeModel discriminator... Model configurations:
MODE: 1 # 1: train, 2: test, 3: eval MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half ) EDGE: 1 # 1: canny, 2: external NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Ca nny SEED: 10 # random seed GPU: [0] # list of gpu ids DEBUG: 0 # turns on debugging mode VERBOSE: 0 # turns on verbose mode in the output console
TRAIN_FLIST: ./datasets/m2d_train.flist VAL_FLIST: ./datasets/m2d_validate.flist TEST_FLIST: ./datasets/m2d_test.flist
TRAIN_EDGE_FLIST: ./datasets/m2d.flist VAL_EDGE_FLIST: ./datasets/m2d.flist TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist
TRAIN_MASK_FLIST: ./datasets/masks2nd_train.flist VAL_MASK_FLIST: ./datasets/m2d_test_mask6.flist TEST_MASK_FLIST: ./datasets/m2d_test_mask.flist
LR: 0.0001 # learning rate D2G_LR: 0.1 # discriminator/generator learning rate ratio BETA1: 0.0 # adam optimizer beta1 BETA2: 0.9 # adam optimizer beta2 BATCH_SIZE: 8 # input batch size for training INPUT_SIZE: 256 # input image size for training 0 for original size SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no e dge) MAX_ITERS: 999 # maximum number of iterations to train the model
EDGE_THRESHOLD: 0.5 # edge detection threshold L1_LOSS_WEIGHT: 1 # l1 loss weight FM_LOSS_WEIGHT: 10 # feature-matching loss weight STYLE_LOSS_WEIGHT: 250 # style loss weight CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight INPAINT_ADV_LOSS_WEIGHT: 0.1 # adversarial loss weight
GAN_LOSS: nsgan # nsgan | lsgan | hinge GAN_POOL_SIZE: 0 # fake images pool size
SAVE_INTERVAL: 10 # how many iterations to wait before saving model (0: never) SAMPLE_INTERVAL: 100 # how many iterations to wait before sampling (0: never) SAMPLE_SIZE: 6 # number of images to sample EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never)
start training...
Training epoch: 1 8 8
End training.... code done`
@Yaqiongchai Ok, now you have the same problem as other people mentioned. Since our model is trained with 2,000,000 iterations, you need to specify a MAX_ITERS
larger than 2,000,000 if you wish to continue training with the pre-trained weights. Based on your configuration, the model stops training when the number of iterations is larger than 999!
. Based on your configuration, the model stops training when the number of iterations is larger than 999
"Based on your configuration, the model stops training when the number of iterations is larger than 999", Yes the model stops training when when the number of iteration is larger than 999, it is for the case that we don't use your pre-trained model. On the other hand, if I'd like to continue training with the pre-trained weights, I'll need to set MAX_ITERS larger than 2,000,000, am I right?
@Yaqiongchai Yes!
Dear Yaqiong
How you solve the problem of the training process ending up instantly. Thank you very much!
Best wishes!
Dear Yaqiong
How you solve the problem of the training process ending up instantly. Thank you very much!
Best wishes!
Knazeri suggested a few things to try -- along the way we found the pre-trained model is available to download. So we used the pre-trained model and picked it up to continue train on our dataset, therefore, we did not try to solve the problem you asked, because we used the pretrained model.
The most quick and dirty way to check is to see if your train/validate/mask datasets are set correctly. You can print it out in model.py to double check. As a newbie I was able to train from the scratch but it still ends at 2000 iterations. I did not get to solve this problem.