keras_Realtime_Multi-Person_Pose_Estimation how long does the training cost?

Jan 20 '18 06:01 trantorrepository

On my 1080ti one epoch is approximately 2-2.5 hours, and you need 30-35 epochs to finish.

Jan 20 '18 16:01 anatolix

My training loss is larger than your loss reporting in https://github.com/michalfaber/keras_Realtime_Multi-Person_Pose_Estimation/issues/45. It is still 740+ after 10 epochs. What may cause it?

Jan 21 '18 10:01 trantorrepository

To be honest no idea. Which project do you training? absolute value of loss in michalfaber and mine project is different due to different hdf5 content, but this is big loss for both mine and michael's versions

Jan 21 '18 18:01 anatolix

I use your latest project. The different setting may be 2 gpus used for training.

Jan 22 '18 01:01 trantorrepository

Did you scaled batch size for 2 gpus? If yes you may need to scale learning rate

Jan 22 '18 06:01 anatolix

How to scale lr according multi-GPU according to your experience? Just multiply the lr by numOfGPU?

Jan 22 '18 11:01 Minotaur-CN

According this https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf Multiply according batch size. I.e. if you raise batch size from 10 to 20, you should multiply LR by two. But I've hasn't tested this project with this setup

Jan 22 '18 21:01 anatolix

I did not. It seems to be the reason.

Jan 23 '18 02:01 trantorrepository

btw did 2 gpus train significally faster in your case? in mine setup gpu load([watch] nvidia-smi) just jumping from one to another, but my gpus are different (1080Ti and 1080) and have no nvlink, so i've just training different models on them

Jan 23 '18 03:01 anatolix

Yes, it is 1.7 faster. But the utilizations of two gpus are not always high at the same time.

Jan 23 '18 14:01 trantorrepository

@anatolix That may be using python to do data augmentation.

The data feeding is low if using this code for multi-gpu training.

Jan 24 '18 08:01 Minotaur-CN

py_rpme_server_tester.py could test speed of augmentation, it is approximately 140 images per second on my machine(alhough hdf5 should be on ssd for that) and this is 5 times faster than C++ implementation, it is far more than we really need for training(10/per second per gpu). I think it is keras implementation of multigpu it is really new and unfinished

Jan 24 '18 10:01 anatolix

To be sure just have committed speed test inside train pose https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation/commit/7e79fd36a5f65131756760e3ef5e80f130fbe0f6#diff-5f9553ab64c88cb242f0b55068ca2e49

On my server it says: batches per second 5.786476637952872 batches per second 5.7619510163686 batches per second 5.842369421224827 batches per second 5.962092320266882 batches per second 5.999656360337656 batches per second 5.951338023827906 batches per second 5.9165966302952695 batches per second 5.906818176697108 batches per second 5.940744568724261 batches per second 5.967964646505151 batches per second 5.970570172200173 batches per second 5.940416025591697 batches per second 5.929933008772442 batches per second 5.9478481273904835 batches per second 5.9353772224932175 batches per second 5.939926683901685 batches per second 5.862215485602886 batches per second 5.87035626635639 batches per second 5.798390861812536 batches per second 5.78362199792317 batches per second 5.7078112813578095 batches per second 5.7466899871438635 batches per second 5.768000631491158 batches per second 5.733300557500513

I.e. in enough for approx 6 cards. This is with parallel model training, i.e. it is second running augmentation on this server.

Jan 24 '18 11:01 anatolix

I train a small model according to the prototxt of the original. and it cost about 3 days, however the result is nearly same as the original. It is only half of the paper's model and 2x faster

Jan 24 '18 12:01 Ai-is-light

@Ai-is-light what is your result in coco？

Jan 26 '18 02:01 trantorrepository

@anatolix yes，keras do not support multi-gpu well, the bug of saving multi-gpu model has not been solved for months. Hoping a tensorflow version.

Jan 26 '18 02:01 trantorrepository

Keras is actually on top of tensorflow, you could use all TF code with keras as well.

Jan 26 '18 10:01 anatolix

@tranorrepository it is shown as following, wechatimg274 how about you?

Feb 06 '18 06:02 Ai-is-light

I find that if the PAF-branch can't work well or the PAF-branch do not work , the mAP and AR can not compute from the output of the network. Did you meet it @anatolix

Feb 06 '18 06:02 Ai-is-light

@tranorrepository @anatolix @Ai-is-light Hi everyone, If I use multi-gpu and double the batch size, do I need change the learning rate （i.e. × 2） accordingly？ The multi-model in keras duplicates the model and separate the input data evenly, so I wonder if we really need change the base （original） learning rate.

Feb 06 '18 07:02 hellojialee

This may be naive but what parameter controls the learning rate, and where can I change it?

May 01 '18 15:05 murrayLuke