fsdl-text-recognizer-2021-labs
fsdl-text-recognizer-2021-labs copied to clipboard
Lab 3 - base.py Acccuracy.update() Error:
System specs XPS 13 Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz NVIDIA Geforce Rtx 2080 Super
Problem
Running the following:
python training/run_experiment.py --max_epochs=10 --gpus=1 --num_workers=4 --data_class=EMNISTLines --min_overlap=0 --max_overlap=0 --model_class=LineCNNSimple --window_width=28 --window_stride=28
Results in the following error:
ValueError: Probabilities in
predsmust sum up to 1 accross the
C dimension.
Solution
I managed to track down the error to the update function within the Accuracy class in base.py.
The offending line is:
preds = torch.nn.functional.softmax(preds, dim=-1)
Where the dim=-1
paramater is causing this value error. Setting this to dim=1
solves the issue and allows training to take place.
I don't fully understand why this is the case or why this error presented in the first place. Any guidance would be appreciated!
Stumbled accross this myself, just created a PR to fix it.
The reason for the problem is that when using the new models in Lab 3 like SimpleLineCNN or the LineCNN the predictions get a 3rd dimension, because it is a sequence of letters now. In Lab1/2 we were predicting single letters only.
The Accuracy Fix/Hack uses dim=-1
, which works as long as there are only 2 dimensions (batch, class), but from lab3 does the softmax over the wrong dimension. (Dims are [128, 83, 32] (bs, num_classes, len_seq)). So setting the softamax to use dims=1
instead of dims=-1
makes it use the correct dimension of classes to "softmax over".
Thanks for that Marc makes sense. I'll +1 your PR!
Closing this now as there is a PR by @mprostock in progress.
I understand your apporach, but it is common practice to leave tickets/issues open until resolved (or replied to by maintainer). My PR might not get accepted, they might choose to fix it in several other different ways. Until then it would be good to keep the issue open, so that other people can easily verify they are not alone with their problem and that this issue exists, until it is actually fixed in the code. So - could you reopen this issue?
Point taken, reopened.
Thanks. Having the same problem.