GANimation icon indicating copy to clipboard operation
GANimation copied to clipboard

training process too slow

Open zzzzhuque opened this issue 6 years ago • 12 comments

Hello, I use 1080Ti to train 200000 images, every image is 128*128. However, the speed of training is very slow, It takes me almost 4 hours to finish one epoch which is different from what you said in your paper. Could you please tell me why?

zzzzhuque avatar Nov 07 '18 07:11 zzzzhuque

I do not remember the time per iteration we had. Check the wait time for a new batch, I strongly believe we had 0, if yours is not zero it is definitely slowing you down. If this was the case, the easy first try is to augment the number of threads. If still is not 0 then what I do is running a profiler across the data loader and identify the bottlenecks. Sorry I do not have more insights.

albertpumarola avatar Nov 09 '18 15:11 albertpumarola

After the training process finished, I test the model, and the result is below. n_0000000203_00647 jpg_out I don't know what happened to my dataset. I use ./bin/FaceLandmarkImg -nomask -simsize 128 -fdir -wild -out_dir

this command to generate AUs from EmotioNet. I use def isgray(convert_img): im = Image.open(convert_img) if im.mode == 'RGB': chip = cv.imread(convert_img) r, g, b = cv.split(chip) r = r.astype(np.float32) g = g.astype(np.float32) b = b.astype(np.float32) s_w, s_h = r.shape[:2] x = (r + b + g) / 3 r_gray = abs(r-x) g_gray = abs(g-x) b_gray = abs(b-x) r_sum = np.sum(r_gray)/(s_w * s_h) g_sum = np.sum(g_gray)/(s_w * s_h) b_sum = np.sum(b_gray)/(s_w * s_h) gray_degree = (r_sum+g_sum+b_sum)/3 if gray_degree < 10: return True # Gray Image else: return False # color Image to filter gray images.

I use threshhold=400 res_ratio = cv.Laplacian(img2, cv.CV_64F).var() res_ratio > threshhold to filter low-res images

The openface generate images in bmp format, so I use def bmp2jpg(origin_img): img = Image.open(origin_img) convert_img = origin_img[:-4] + '.jpg' img.save(convert_img) return convert_img to save bmp images as jpg images.

I prepared 200000 images while 100000 images nameswrite in train_ids.csv 100000 images names write in test_ids.csv and all images are put in imgs.

Could you please tell me what's wrong with my dataset?

zzzzhuque avatar Nov 12 '18 12:11 zzzzhuque

Seems more a problem of not being able to load the weights. This is the usual output for random weights.

albertpumarola avatar Nov 12 '18 12:11 albertpumarola

This is the result after I python test.py Is there any problems? ------------ Options ------------- aus_file: aus_openface.pkl batch_size: 4 checkpoints_dir: ./checkpoints cond_nc: 17 data_dir: None dataset_mode: aus do_saturate_mask: False gpu_ids: [0] image_size: 128 images_folder: imgs input_path: /home/zhutao/N_0000000203_00647.jpg is_train: False load_epoch: 30 model: ganimation n_threads_test: 1 name: experiment_1 output_dir: ./output serial_batches: False test_ids_file: test_ids.csv train_ids_file: train_ids.csv -------------- End ---------------- ./checkpoints/experiment_1 Network generator_wasserstein_gan was created Network discriminator_wasserstein_gan was created loaded net: ./checkpoints/experiment_1/net_epoch_30_id_G.pth Model GANimation was created

zzzzhuque avatar Nov 12 '18 12:11 zzzzhuque

"input_path: /home/zhutao/N_0000000203_00647.jpg" This does not look good at all. Are you training with only one image?

albertpumarola avatar Nov 12 '18 12:11 albertpumarola

No. I mean I use bash launch/run_train.sh to train the model. There are 200000 images in /dataset/imgs

Then I use python test.py --input_path /home/zhutao/N_0000000203_00647.jpg to test the model

zzzzhuque avatar Nov 12 '18 12:11 zzzzhuque

And the param data_dir in run_train.sh is pointing to your dataset? While training, the code generates an events file that can be visualized using tensorboard. Check there that the images in every batch are the ones in your dataset.

albertpumarola avatar Nov 12 '18 13:11 albertpumarola

Yes, the param data_dir in run_train.sh is pointing to my dataset. And this is the files after 30 epochs training. 2018-11-12 21-45-59 I think I've loaded net: ./checkpoints/experiment_1/net_epoch_30_id_G.pth successfully when testing the model.

zzzzhuque avatar Nov 12 '18 14:11 zzzzhuque

And this is the loss during training 2018-11-13 10-17-08

zzzzhuque avatar Nov 13 '18 02:11 zzzzhuque

@ZHUTAO142857 You should add parameter -aus to do FaceLandmarkImg. The .csv file generated will be different! See data/prepare_au_annotations.py.

njulb0125 avatar Jul 04 '19 07:07 njulb0125

Hello, I use 1080Ti to train 200000 images, every image is 128*128. However, the speed of training is very slow, It takes me almost 4 hours to finish one epoch which is different from what you said in your paper. Could you please tell me why? can you share the 200,000 images for me, thanks!

xinyuxiao avatar Feb 04 '21 08:02 xinyuxiao

I think you got the wrong email dude, sorry.

On Thu, Feb 4, 2021 at 1:03 AM xinyuxiao [email protected] wrote:

Hello, I use 1080Ti to train 200000 images, every image is 128*128. However, the speed of training is very slow, It takes me almost 4 hours to finish one epoch which is different from what you said in your paper. Could you please tell me why? can you share the 200,000 images for me, thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/albertpumarola/GANimation/issues/62#issuecomment-773113384, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQP3M2MT37EQEIWZADNLGF3S5JIGLANCNFSM4GCHV3NA .

fabiantheking avatar Feb 05 '21 04:02 fabiantheking