parseq icon indicating copy to clipboard operation
parseq copied to clipboard

loss nan

Open daeing opened this issue 2 years ago • 6 comments

My graphics card does not have 32G memory. After I changed the batch size to a smaller one, the loss appeared nan, and I couldn't get the best model. Do you have any suggestion?

daeing avatar Sep 21 '22 03:09 daeing

Please give more details. Which model are you training? Which training data are you using? What is the batch size? Did you change the learning rate?

Anyway, I never experienced a NaN loss (at least for PARSeq) for various batch sizes and learning rates I used. Try changing the learning rate, or disabling mixed precision/fp16 training.

baudm avatar Sep 21 '22 07:09 baudm

Please give more details. Which model are you training? Which training data are you using? What is the batch size? Did you change the learning rate?

Anyway, I never experienced a NaN loss (at least for PARSeq) for various batch sizes and learning rates I used. Try changing the learning rate, or disabling mixed precision/fp16 training.

I trained vitstr and I used the datasets you provided in dataset.md file. I change the batch size to 120 and change max lr 2e-3 to 1e-3. but also get nan loss.

if gpus: # Use mixed-precision training # config.trainer.precision = 16 config.trainer.precision = 32 Is this the right way to disabling mixed precision/fp16 training?

daeing avatar Sep 22 '22 01:09 daeing

You could just comment out config.trainer.precision = 16.

Does the NaN loss happen immediately? Or after several epochs only?

baudm avatar Sep 24 '22 06:09 baudm

For me, it happens after several epochs. 16 to be exact.

However, this I am seeing in training PARSeq on an Indian language (Manipuri, another subtype of Bengali).

i will try commenting out config.trainer.precision = 16.

One problem which I have observed is incorrect max_length.

Could anything else be the reason for this?

harshlunia7 avatar Nov 23 '22 12:11 harshlunia7

For me, it happens after several epochs. 16 to be exact.

However, this I am seeing in training PARSeq on an Indian language (Manipuri, another subtype of Bengali).

i will try commenting out config.trainer.precision = 16.

One problem which I have observed is incorrect max_length.

Could anything else be the reason for this?

@harshlunia7 were you able to solve this loss nan issue ?

rajeevbaalwan avatar Mar 30 '23 05:03 rajeevbaalwan

@rajeevbaalwan Yes, removing the precision setting on using cuda, solved the issue for me. I modified the following options in config file main.yaml as well:

data.remove_whitespace: false

data.normalize_unicode: false

data.augment: false

model.batch_size: 128

harshlunia7 avatar Jul 21 '23 09:07 harshlunia7