how long does the training cost?
On my 1080ti one epoch is approximately 2-2.5 hours, and you need 30-35 epochs to finish.
My training loss is larger than your loss reporting in https://github.com/michalfaber/keras_Realtime_Multi-Person_Pose_Estimation/issues/45. It is still 740+ after 10 epochs. What may cause it?
To be honest no idea. Which project do you training? absolute value of loss in michalfaber and mine project is different due to different hdf5 content, but this is big loss for both mine and michael's versions
I use your latest project. The different setting may be 2 gpus used for training.
Did you scaled batch size for 2 gpus? If yes you may need to scale learning rate
How to scale lr according multi-GPU according to your experience? Just multiply the lr by numOfGPU?
According this https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf Multiply according batch size. I.e. if you raise batch size from 10 to 20, you should multiply LR by two. But I've hasn't tested this project with this setup
I did not. It seems to be the reason.
btw did 2 gpus train significally faster in your case? in mine setup gpu load([watch] nvidia-smi) just jumping from one to another, but my gpus are different (1080Ti and 1080) and have no nvlink, so i've just training different models on them
Yes, it is 1.7 faster. But the utilizations of two gpus are not always high at the same time.
@anatolix That may be using python to do data augmentation.
The data feeding is low if using this code for multi-gpu training.
py_rpme_server_tester.py could test speed of augmentation, it is approximately 140 images per second on my machine(alhough hdf5 should be on ssd for that) and this is 5 times faster than C++ implementation, it is far more than we really need for training(10/per second per gpu). I think it is keras implementation of multigpu it is really new and unfinished
To be sure just have committed speed test inside train pose https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation/commit/7e79fd36a5f65131756760e3ef5e80f130fbe0f6#diff-5f9553ab64c88cb242f0b55068ca2e49
On my server it says: batches per second 5.786476637952872 batches per second 5.7619510163686 batches per second 5.842369421224827 batches per second 5.962092320266882 batches per second 5.999656360337656 batches per second 5.951338023827906 batches per second 5.9165966302952695 batches per second 5.906818176697108 batches per second 5.940744568724261 batches per second 5.967964646505151 batches per second 5.970570172200173 batches per second 5.940416025591697 batches per second 5.929933008772442 batches per second 5.9478481273904835 batches per second 5.9353772224932175 batches per second 5.939926683901685 batches per second 5.862215485602886 batches per second 5.87035626635639 batches per second 5.798390861812536 batches per second 5.78362199792317 batches per second 5.7078112813578095 batches per second 5.7466899871438635 batches per second 5.768000631491158 batches per second 5.733300557500513
I.e. in enough for approx 6 cards. This is with parallel model training, i.e. it is second running augmentation on this server.
I train a small model according to the prototxt of the original. and it cost about 3 days, however the result is nearly same as the original. It is only half of the paper's model and 2x faster
@Ai-is-light what is your result in coco?
@anatolix yes,keras do not support multi-gpu well, the bug of saving multi-gpu model has not been solved for months. Hoping a tensorflow version.
Keras is actually on top of tensorflow, you could use all TF code with keras as well.
@tranorrepository it is shown as following,
how about you?
I find that if the PAF-branch can't work well or the PAF-branch do not work , the mAP and AR can not compute from the output of the network. Did you meet it @anatolix
@tranorrepository @anatolix @Ai-is-light Hi everyone, If I use multi-gpu and double the batch size, do I need change the learning rate (i.e. × 2) accordingly? The multi-model in keras duplicates the model and separate the input data evenly, so I wonder if we really need change the base (original) learning rate.
This may be naive but what parameter controls the learning rate, and where can I change it?