MagicDrive icon indicating copy to clipboard operation
MagicDrive copied to clipboard

About camera_param embedding and "loss is NaN"

Open Bowa529 opened this issue 1 year ago • 11 comments

Hello, I am using my own dataset to replace the nuscenes dataset for experiments. When I input camera params as conditions into the model, after training for thousands of steps, I always encounter the problem of "loss is NaN". I noticed that after UNet, my model_pred has many Inf values. May I ask if the embedding of camera params will have any impact? When there are differences in the focal length and center coordinates of my camera parameters, including rotation, translation, and the parameters of the Nuscenes dataset.

Bowa529 avatar Dec 14 '24 09:12 Bowa529

I also print the input, but the input has no inf; and I only put text prompt and bev map into model is correct. But when I also input my camera param, it has problems.

Bowa529 avatar Dec 14 '24 09:12 Bowa529

The generalization ability of camera params is limited, which is a known issue as presented in our work, MagicDrive3D. However, we have tried some parameters different from the nuScene and did not observe NaN or Inf (anyway, the results are not satisfactory).

NaN in training can be due to many reasons. You may refer to some previous issues for solutions.

Besides, if you think camera pose embedding is the key reason. In our latest work, we implemented the "base_token" + "zero_proj" module to mitigate such issues on any token in the sequence embeddings (we do not include this part in the paper as they may not be useful when training from scratch). Please check https://github.com/flymin/MagicDriveDiT/blob/d537ecfbf7d83af4518b6509c8b99c4c467c8264/magicdrivedit/models/magicdrive/magicdrive_stdit3.py#L999

flymin avatar Dec 14 '24 13:12 flymin

When handling two Field of View (FOV) perspectives, I tried training with camera parameters for both views separately. One perspective trains normally, but the other still encounters the "loss is NaN" issue. If there are significant differences in camera intrinsics like focal length, do you think the camera parameter encoding process needs to be modified? Or do you have any other suggestions? Thanks for your help.

Bowa529 avatar Dec 17 '24 02:12 Bowa529

Please consider this, zero init should help to stable the training process.

Besides, if you think camera pose embedding is the key reason. In our latest work, we implemented the "base_token" + "zero_proj" module to mitigate such issues on any token in the sequence embeddings (we do not include this part in the paper as they may not be useful when training from scratch). Please check flymin/MagicDriveDiT@d537ecf/magicdrivedit/models/magicdrive/magicdrive_stdit3.py#L999

flymin avatar Dec 17 '24 03:12 flymin

Sure, thank you. I will try the methods you suggested.

Bowa529 avatar Dec 17 '24 06:12 Bowa529

I'm sorry that I still have a question: Why do you concatenate the original input with the results after sine and cosine encoding to form camera_emb in your camera parameter encoding process? Thanks for your help.

Bowa529 avatar Dec 17 '24 08:12 Bowa529

I think you are talking about the Fourier Embedding, which we borrowed from NeRF. You can find the citation (Mildenhall et al., 2020) in our paper.

flymin avatar Dec 17 '24 08:12 flymin

This issue is stale because it has been open for 7 days with no activity. If you do not have any follow-ups, the issue will be closed soon.

github-actions[bot] avatar Dec 24 '24 16:12 github-actions[bot]

I'm sorry that I still have a question: Why do you concatenate the original input with the results after sine and cosine encoding to form camera_emb in your camera parameter encoding process? Thanks for your help.

did u solve that? thanks

3buffers avatar Dec 29 '24 10:12 3buffers

I'm sorry that I still have a question: Why do you concatenate the original input with the results after sine and cosine encoding to form camera_emb in your camera parameter encoding process? Thanks for your help.

did u solve that? thanks

acturally i think is due to data type reason, for instance if you use your dataset to replace nuscenes, you should focus on your bboxes.dtype, if you still use float16 which will be make overflow when doing fourier_embedder specificlly in doing embed_fns will make float overflow to make you box embed to be inf or nan, thus you should make bigger dtype like float32 or 64 i think

3buffers avatar Dec 31 '24 02:12 3buffers

This issue is stale because it has been open for 7 days with no activity. If you do not have any follow-ups, the issue will be closed soon.

github-actions[bot] avatar Jan 07 '25 16:01 github-actions[bot]