parseq
parseq copied to clipboard
loss nan
My graphics card does not have 32G memory. After I changed the batch size to a smaller one, the loss appeared nan, and I couldn't get the best model. Do you have any suggestion?
Please give more details. Which model are you training? Which training data are you using? What is the batch size? Did you change the learning rate?
Anyway, I never experienced a NaN loss (at least for PARSeq) for various batch sizes and learning rates I used. Try changing the learning rate, or disabling mixed precision/fp16 training.
Please give more details. Which model are you training? Which training data are you using? What is the batch size? Did you change the learning rate?
Anyway, I never experienced a NaN loss (at least for PARSeq) for various batch sizes and learning rates I used. Try changing the learning rate, or disabling mixed precision/fp16 training.
I trained vitstr and I used the datasets you provided in dataset.md file. I change the batch size to 120 and change max lr 2e-3 to 1e-3. but also get nan loss.
if gpus: # Use mixed-precision training # config.trainer.precision = 16 config.trainer.precision = 32 Is this the right way to disabling mixed precision/fp16 training?
You could just comment out config.trainer.precision = 16
.
Does the NaN loss happen immediately? Or after several epochs only?
For me, it happens after several epochs. 16 to be exact.
However, this I am seeing in training PARSeq on an Indian language (Manipuri, another subtype of Bengali).
i will try commenting out config.trainer.precision = 16
.
One problem which I have observed is incorrect max_length.
Could anything else be the reason for this?
For me, it happens after several epochs. 16 to be exact.
However, this I am seeing in training PARSeq on an Indian language (Manipuri, another subtype of Bengali).
i will try commenting out
config.trainer.precision = 16
.One problem which I have observed is incorrect max_length.
Could anything else be the reason for this?
@harshlunia7 were you able to solve this loss nan issue ?
@rajeevbaalwan Yes, removing the precision setting on using cuda, solved the issue for me.
I modified the following options in config file main.yaml
as well:
data.remove_whitespace: false
data.normalize_unicode: false
data.augment: false
model.batch_size: 128