DocTR finetuning error in CTC loss when target is long.
Bug description
If the target text is long, say about 14-20 characters, I am unable to fine-tune. Is there any flag to change the max-length?
Code snippet to reproduce the bug
import sys
sys.argv = ['crnn_vgg16_bn',
'--train_path', '../data/content/datasets/train/',
'--val_path', '../data/content/datasets/val/',
'-b', '4',
'--input_size', '32',
'--max-chars', '42']
# '--show-samples']
Error traceback
Expected tensor to have size at least 42 at dimension 1, but got size 32 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)
Environment
For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py --2024-02-23 17:06:46-- https://raw.githubusercontent.com/mindee/doctr/main/scripts/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 10644 (10K) [text/plain] Saving to: ‘collect_env.py’
collect_env.py 100%[=============================================>] 10.39K --.-KB/s in 0.001s
2024-02-23 17:06:46 (15.6 MB/s) - ‘collect_env.py’ saved [10644/10644]
Collecting environment information...
DocTR version: N/A TensorFlow version: N/A PyTorch version: N/A (torchvision N/A) OpenCV version: N/A OS: Ubuntu 22.04.1 LTS Python version: 3.8.0 Is CUDA available (TensorFlow): N/A Is CUDA available (PyTorch): N/A CUDA runtime version: 11.5.119 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Ti Nvidia driver version: 546.01 cuDNN version: Could not collect
Deep Learning backend
is_tf_available: False is_torch_available: True
Hi @ajkdrag :wave:,
The max length for crnn_vgg16_bn is pinned to 32 (for example master has a max length of 50).
https://github.com/mindee/doctr/blob/dd1fbbe69903321f30a86aec264b491038d57a30/doctr/models/recognition/crnn/pytorch.py#L129 (crnn is the only arch where it is fixed atm - we should change this)
master: https://github.com/mindee/doctr/blob/dd1fbbe69903321f30a86aec264b491038d57a30/doctr/models/recognition/master/pytorch.py#L62
But keep in mind we haven't experimented with different lengths so changing this can lead to unexpected behavior :)
If I am to finetune this. Will it work if I change this max_length arg. Also is there a way to force the vocab? In paddleOCR i can define a dictionary of valid vocabs for my domain. Is it possible to do thishere as well?
Mh i took a short look for all other models yes for crnn not at the moment because this requires some changes in the CTC-PostProcessor .. What do you mean with with "force the vocab" ? :)
You can define your own vocab of course
If not already available in: https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py
Add it and afterwards you can add --vocab=your-vocab-name to the train command
Got it thanks. I am currently finetuning the recog model as per the docs, but when I load the checkpoint, i get this error:
Error(s) in loading state_dict for CRNN:
size mismatch for linear.weight: copying a param with shape torch.Size([127, 256]) from checkpoint, the shape in current model is torch.Size([124, 256]).
size mismatch for linear.bias: copying a param with shape torch.Size([127]) from checkpoint, the shape in current model is torch.Size([124]).
f_recs_model="cheque-parser/notebooks/crnn_vgg16_bn_20240223-123911.pt"
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False)
reco_params = torch.load(f_recs_model, map_location="cpu")
reco_model.load_state_dict(reco_params)
My finetuning args:
sys.argv = ['crnn_vgg16_bn',
'--train_path', '../../datasets/rec/v1/train/',
'--val_path', '../../datasets/rec/v1/val/',
'-b', '32',
'--input_size', '32',
'--pretrained']
from doctr.datasets import VOCABS
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=VOCABS["french"])
crnn_vgg16_bn was the first published model it is still trained on the old vocab (legacy_french) which contains less chars as the current default (for all other models -> french)
Thanks. The loading works, but the model training isn't doing too well. Do you have any training recipes? Do i need to resize the images before running the finetune? Here's what I did: I took a few cheque images (~400), ran it through Google's Doc AI to get the bboxes and texts, and created the dataset in the correct format. I can see that some images are really small. Will that cause problems?
Also, I am running into another issue: vocab mismatch. The ground truth has some characters that aren't in the vocab. How to deal with those? Is there a helper script/library that can convert my annotations to the VOCAB["english"] or VOCAB["french"].
UPDATE:
I fixed the vacab issue by using unidecode lib on my labels to convert them all to ascii.
While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3
sys.argv = ['parseq',
'--train_path', '../../datasets/rec/v1/train/',
'--val_path', '../../datasets/rec/v1/val',
'-b', '64',
'--epochs', '3',
'--save-dir', '../../temp/saves/',
'--pretrained']
# '--show-samples']
While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3
sys.argv = ['parseq', '--train_path', '../../datasets/rec/v1/train/', '--val_path', '../../datasets/rec/v1/val', '-b', '64', '--epochs', '3', '--save-dir', '../../temp/saves/', '--pretrained'] # '--show-samples']
In this case you need to increase the models max_length the max label length + 1 (in your case 34 if the longest label has 32 characters)
https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/references/recognition/train_pytorch.py#L233
model = recognition.__dict__[args.arch](pretrained=args.pretrained, vocab=vocab, max_length=...)
Same would then be required by loading the custom trained model :)
And loading example:
import torch
from doctr.models import ocr_predictor, parseq
from doctr.datasets import VOCABS
# change to your defined vocab and the used max_length from training / vocab can also defined directly as string (but keep in mind to have the same order)
reco_model = parseq(pretrained=False, pretrained_backbone=False, vocab=VOCABS["german"], max_length=..)
reco_params = torch.load('<path_to_pt>', map_location="cpu")
reco_model.load_state_dict(reco_params)
predictor = ocr_predictor(det_arch='db_resnet50', reco_arch=reco_model, pretrained=True)
About the pre-processing for recognition by default all samples are resized to 32x128 by keeping the aspect ratio while training we apply some augmentations randomly like noise / blur / shadow / perspective transform / etc.
While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3
sys.argv = ['parseq', '--train_path', '../../datasets/rec/v1/train/', '--val_path', '../../datasets/rec/v1/val', '-b', '64', '--epochs', '3', '--save-dir', '../../temp/saves/', '--pretrained'] # '--show-samples']In this case you need to increase the models
max_lengththe max label length + 1 (in your case 34 if the longest label has 32 characters)https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/references/recognition/train_pytorch.py#L233
model = recognition.__dict__[args.arch](pretrained=args.pretrained, vocab=vocab, max_length=...)Same would then be required by loading the custom trained model :)
I am trying to finetune parseq. When I add max_length=34 to the train_pytorch.py script, it errors about model shape not matching against checkpoint.
Mh in this case you need to modify the ignored keys from state dict at: (reinit random)
https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/doctr/models/recognition/parseq/pytorch.py#L479
to:
ignore_keys=["pos_queries", "embed.embedding.weight", "head.weight", "head.bias"],
NOTE: This will be only triggered if you use a different vocab as the default one (french)
Ok got it. So if i use the french vocab with parseq, there shouldn't be any issue with fine-tuning?