Deep3DFaceReconstruction
Deep3DFaceReconstruction copied to clipboard
Getting NaN lm_error while training the model
Getting NaN lm_error while training the model.
Hi, can you explain more about this error? When does this error occur? Does it happen randomly or happen on certain images? We have test the training code on example images as well as other face image datasets but did not find such error.
It is happening randomly. I am trying to train the model on the collection of images from the various datasets mentioned in the paper.
On Mon, 4 Jan 2021, 16:38 YuDeng, [email protected] wrote:
Hi, can you explain more about this error? When does this error occur? Does it happen randomly or happen on certain images? We have test the train code on example images as well as other face image datasets but did not find such error.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/Deep3DFaceReconstruction/issues/105#issuecomment-753912900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTWYBARTBADDBROCTPF6F3SYGOUPANCNFSM4VQPVZ4Q .
Hi,htktwr95,can you share CoarseData with me by baiduyun? thanks.
Is it really related to dataset? I have ~200K images in my dataset. Do I have to check these images individually? I ran the code with 5 handpicked images only (properly illuminated + some other constraints). I am getting this error again and again!
Hi, this error should not be related to dataset. I will check if there exists a bug in the code to be fixed.
No, the error is due to the expression coefficients. If you regress your model over rest of the coefficients then that problem won't occur.
So basically it's not fixed yet.
On Wed, 31 Mar 2021, 07:29 Dean, @.***> wrote:
hello,did U fixed this bug?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/Deep3DFaceReconstruction/issues/105#issuecomment-810698613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTWYBHI2UQJSNFBFVOFDH3TGJ6XVANCNFSM4VQPVZ4Q .
I have the same problem. I ran the training code on example images.
Iter: 7958; lm_loss: 2.760531 ; photo_loss: 0.079728; id_loss: 0.088676 Iter: 7959; lm_loss: 3.171836 ; photo_loss: 0.085399; id_loss: 0.094497 Iter: 7960; lm_loss: 5.258860 ; photo_loss: 0.089806; id_loss: 0.096231 Iter: 7961; lm_loss: 5.292349 ; photo_loss: 0.085974; id_loss: 0.102064 Iter: 7962; lm_loss: 4.916460 ; photo_loss: 0.092238; id_loss: 0.116507 Iter: 7963; lm_loss: 4.080110 ; photo_loss: 0.083091; id_loss: 0.102046 Iter: 7964; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.829207 Iter: 7965; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.831248 Iter: 7966; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.817078 Iter: 7967; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.849674
The error message is:
InvalidArgumentError (see above for traceback): Nan in summary histogram for: ex_coeff
Looking forward to a solution. Thanks.
I have the same problem. I ran the training code on example images.
Iter: 7958; lm_loss: 2.760531 ; photo_loss: 0.079728; id_loss: 0.088676 Iter: 7959; lm_loss: 3.171836 ; photo_loss: 0.085399; id_loss: 0.094497 Iter: 7960; lm_loss: 5.258860 ; photo_loss: 0.089806; id_loss: 0.096231 Iter: 7961; lm_loss: 5.292349 ; photo_loss: 0.085974; id_loss: 0.102064 Iter: 7962; lm_loss: 4.916460 ; photo_loss: 0.092238; id_loss: 0.116507 Iter: 7963; lm_loss: 4.080110 ; photo_loss: 0.083091; id_loss: 0.102046 Iter: 7964; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.829207 Iter: 7965; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.831248 Iter: 7966; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.817078 Iter: 7967; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.849674
The error message is:
InvalidArgumentError (see above for traceback): Nan in summary histogram for: ex_coeffLooking forward to a solution. Thanks.
Hi, do you use the same tf_mesh_renderer version as in the readme? I'll try to reproduce this error on my own environment.
Hi, I follow the exact same procedure as in readme to train on example images provided in this repo for several times, but the nan error does not occur.
Following is the details of my environment: ubuntu 16.04 conda environment python==3.6 tensorflow version==1.12.0, installed via conda install tf_mesh_renderer branch ba27ea1798, compiled with gcc version 7.5.0 and set -D_GLIBCXX_USE_CXX11_ABI=1 as in read me.
I run three different trainings with default training settings and train up to 15k iterations, and none of them have nan error.
Could you tell me your detailed environment setting?
The details of my environment:
Windows 10 conda environment python==3.6 tensorflow-gpu==1.13.1
The tf_mesh_renderer I used has tested in Windows, it could render a 2d image correctly.
I found that the error occurs randomly. When I trained another 10k iterations (from iter 7900), the error didn't appear.
I recommend using Linux environment and tensorflow 1.12 because we have conducted all the experiments under this setting. I'm not sure if tf_mesh_renderer works correctly under windows. Although the forward rendering may seem to be correct, there might be some bugs in the backprop that lead to nan error.
I recommend using Linux environment and tensorflow 1.12 because we have conducted all the experiments under this setting. I'm not sure if tf_mesh_renderer works correctly under windows. Although the forward rendering may seem to be correct, there might be some bugs in the backprop that lead to nan error.
Thank you for your reply.
But I don't know whether @htktwr95 use the same environment as yours.
I have the same problem. I only change a new landmark detector to give the landmark of each image. My environment is alse python3.6 and tensorflow 1.12. So it may be leaded by error some landmark of data, instead of environment?