Deep3DFaceReconstruction icon indicating copy to clipboard operation
Deep3DFaceReconstruction copied to clipboard

Getting NaN lm_error while training the model

Open htktwr95 opened this issue 4 years ago • 13 comments

Getting NaN lm_error while training the model.

htktwr95 avatar Jan 01 '21 19:01 htktwr95

Hi, can you explain more about this error? When does this error occur? Does it happen randomly or happen on certain images? We have test the training code on example images as well as other face image datasets but did not find such error.

YuDeng avatar Jan 04 '21 11:01 YuDeng

It is happening randomly. I am trying to train the model on the collection of images from the various datasets mentioned in the paper.

On Mon, 4 Jan 2021, 16:38 YuDeng, [email protected] wrote:

Hi, can you explain more about this error? When does this error occur? Does it happen randomly or happen on certain images? We have test the train code on example images as well as other face image datasets but did not find such error.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/Deep3DFaceReconstruction/issues/105#issuecomment-753912900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTWYBARTBADDBROCTPF6F3SYGOUPANCNFSM4VQPVZ4Q .

htktwr95 avatar Jan 05 '21 06:01 htktwr95

Hi,htktwr95,can you share CoarseData with me by baiduyun? thanks.

cyjouc avatar Jan 07 '21 09:01 cyjouc

Is it really related to dataset? I have ~200K images in my dataset. Do I have to check these images individually? I ran the code with 5 handpicked images only (properly illuminated + some other constraints). I am getting this error again and again!

htktwr95 avatar Jan 09 '21 09:01 htktwr95

Hi, this error should not be related to dataset. I will check if there exists a bug in the code to be fixed.

YuDeng avatar Jan 11 '21 05:01 YuDeng

No, the error is due to the expression coefficients. If you regress your model over rest of the coefficients then that problem won't occur.

So basically it's not fixed yet.

On Wed, 31 Mar 2021, 07:29 Dean, @.***> wrote:

hello,did U fixed this bug?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/Deep3DFaceReconstruction/issues/105#issuecomment-810698613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTWYBHI2UQJSNFBFVOFDH3TGJ6XVANCNFSM4VQPVZ4Q .

htktwr95 avatar Mar 31 '21 04:03 htktwr95

I have the same problem. I ran the training code on example images.

Iter: 7958; lm_loss: 2.760531 ; photo_loss: 0.079728; id_loss: 0.088676 Iter: 7959; lm_loss: 3.171836 ; photo_loss: 0.085399; id_loss: 0.094497 Iter: 7960; lm_loss: 5.258860 ; photo_loss: 0.089806; id_loss: 0.096231 Iter: 7961; lm_loss: 5.292349 ; photo_loss: 0.085974; id_loss: 0.102064 Iter: 7962; lm_loss: 4.916460 ; photo_loss: 0.092238; id_loss: 0.116507 Iter: 7963; lm_loss: 4.080110 ; photo_loss: 0.083091; id_loss: 0.102046 Iter: 7964; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.829207 Iter: 7965; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.831248 Iter: 7966; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.817078 Iter: 7967; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.849674

The error message is: InvalidArgumentError (see above for traceback): Nan in summary histogram for: ex_coeff

Looking forward to a solution. Thanks.

hyukhea avatar Apr 28 '21 00:04 hyukhea

I have the same problem. I ran the training code on example images.

Iter: 7958; lm_loss: 2.760531 ; photo_loss: 0.079728; id_loss: 0.088676 Iter: 7959; lm_loss: 3.171836 ; photo_loss: 0.085399; id_loss: 0.094497 Iter: 7960; lm_loss: 5.258860 ; photo_loss: 0.089806; id_loss: 0.096231 Iter: 7961; lm_loss: 5.292349 ; photo_loss: 0.085974; id_loss: 0.102064 Iter: 7962; lm_loss: 4.916460 ; photo_loss: 0.092238; id_loss: 0.116507 Iter: 7963; lm_loss: 4.080110 ; photo_loss: 0.083091; id_loss: 0.102046 Iter: 7964; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.829207 Iter: 7965; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.831248 Iter: 7966; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.817078 Iter: 7967; lm_loss: nan ; photo_loss: 0.000000; id_loss: 0.849674

The error message is: InvalidArgumentError (see above for traceback): Nan in summary histogram for: ex_coeff

Looking forward to a solution. Thanks.

Hi, do you use the same tf_mesh_renderer version as in the readme? I'll try to reproduce this error on my own environment.

YuDeng avatar Apr 28 '21 02:04 YuDeng

Hi, I follow the exact same procedure as in readme to train on example images provided in this repo for several times, but the nan error does not occur.

Following is the details of my environment: ubuntu 16.04 conda environment python==3.6 tensorflow version==1.12.0, installed via conda install tf_mesh_renderer branch ba27ea1798, compiled with gcc version 7.5.0 and set -D_GLIBCXX_USE_CXX11_ABI=1 as in read me.

I run three different trainings with default training settings and train up to 15k iterations, and none of them have nan error.

Could you tell me your detailed environment setting?

YuDeng avatar Apr 28 '21 05:04 YuDeng

The details of my environment:

Windows 10 conda environment python==3.6 tensorflow-gpu==1.13.1

The tf_mesh_renderer I used has tested in Windows, it could render a 2d image correctly.

I found that the error occurs randomly. When I trained another 10k iterations (from iter 7900), the error didn't appear.

hyukhea avatar Apr 28 '21 06:04 hyukhea

I recommend using Linux environment and tensorflow 1.12 because we have conducted all the experiments under this setting. I'm not sure if tf_mesh_renderer works correctly under windows. Although the forward rendering may seem to be correct, there might be some bugs in the backprop that lead to nan error.

YuDeng avatar Apr 28 '21 06:04 YuDeng

I recommend using Linux environment and tensorflow 1.12 because we have conducted all the experiments under this setting. I'm not sure if tf_mesh_renderer works correctly under windows. Although the forward rendering may seem to be correct, there might be some bugs in the backprop that lead to nan error.

Thank you for your reply.

But I don't know whether @htktwr95 use the same environment as yours.

hyukhea avatar Apr 28 '21 07:04 hyukhea

I have the same problem. I only change a new landmark detector to give the landmark of each image. My environment is alse python3.6 and tensorflow 1.12. So it may be leaded by error some landmark of data, instead of environment?

qingmeizhujiu avatar Jun 28 '21 13:06 qingmeizhujiu