diffusion-point-cloud icon indicating copy to clipboard operation
diffusion-point-cloud copied to clipboard

Met Nan after 150000 Iter

Open JudgeLJX opened this issue 1 year ago • 2 comments

Hello,

Thanks for your work.

May I ask, have you met this problem in your training with train_gen.py

[lgan_mmd-CD] nan [lgan_cov-CD] 0.24250001 [lgan_mmd_smp-CD] nan Traceback (most recent call last): File "train_gen.py", line 222, in test(it) File "train_gen.py", line 185, in test jsd = jsd_between_point_cloud_sets(gen_pcs.cpu().numpy(), ref_pcs.cpu().numpy()) File "/home2/diffusion-point-cloud/evaluation/evaluation_metrics.py", line 260, in jsd_between_point_cloud_sets sample_pcs, resolution, in_unit_sphere)[1] File "/home2/diffusion-point-cloud/evaluation/evaluation_metrics.py", line 291, in entropy_of_occupancy_grid _, indices = nn.kneighbors(pc) File "/home2/miniconda3/envs/dpm-pc-gen/lib/python3.7/site-packages/sklearn/neighbors/_base.py", line 670, in kneighbors X = check_array(X, accept_sparse='csr') File "/home2/miniconda3/envs/dpm-pc-gen/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) File "/home2/miniconda3/envs/dpm-pc-gen/lib/python3.7/site-packages/sklearn/utils/validation.py", line 721, in check_array allow_nan=force_all_finite == 'allow-nan') File "/home2/miniconda3/envs/dpm-pc-gen/lib/python3.7/site-packages/sklearn/utils/validation.py", line 106, in _assert_all_finite msg_dtype if msg_dtype is not None else X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I think the training will not stop until we manually stop it because iter is set to inf. However it failed to generate samples using 150000.pt

Best Wishes

JudgeLJX avatar Jan 12 '24 16:01 JudgeLJX

Hello, I have encountered the same problem, may I ask if you have solved the reason why NaN occurs?

Anonymous-AAAI-project avatar Feb 06 '24 09:02 Anonymous-AAAI-project

Usually NaN errors occur during model training due to values getting so small they become 0. Dividing by that 0 is no good. My guess is that your model is training so long, some value somewhere becomes 0 and breaks the training.

GunnerStone avatar Apr 05 '24 20:04 GunnerStone