Bug description

If the target text is long, say about 14-20 characters, I am unable to fine-tune. Is there any flag to change the max-length?

Code snippet to reproduce the bug

import sys
sys.argv = ['crnn_vgg16_bn', 
            '--train_path', '../data/content/datasets/train/',
            '--val_path', '../data/content/datasets/val/',
            '-b', '4',
            '--input_size', '32',
            '--max-chars', '42']
            # '--show-samples']

Error traceback

Expected tensor to have size at least 42 at dimension 1, but got size 32 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)

Environment

For security purposes, please check the contents of collect_env.py before running it.

python collect_env.py --2024-02-23 17:06:46-- https://raw.githubusercontent.com/mindee/doctr/main/scripts/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 10644 (10K) [text/plain] Saving to: ‘collect_env.py’

collect_env.py 100%[=============================================>] 10.39K --.-KB/s in 0.001s

2024-02-23 17:06:46 (15.6 MB/s) - ‘collect_env.py’ saved [10644/10644]

Collecting environment information...

DocTR version: N/A TensorFlow version: N/A PyTorch version: N/A (torchvision N/A) OpenCV version: N/A OS: Ubuntu 22.04.1 LTS Python version: 3.8.0 Is CUDA available (TensorFlow): N/A Is CUDA available (PyTorch): N/A CUDA runtime version: 11.5.119 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Ti Nvidia driver version: 546.01 cuDNN version: Could not collect

Deep Learning backend

is_tf_available: False is_torch_available: True

Feb 23 '24 11:02 ajkdrag

Hi @ajkdrag :wave:,

The max length for crnn_vgg16_bn is pinned to 32 (for example master has a max length of 50). https://github.com/mindee/doctr/blob/dd1fbbe69903321f30a86aec264b491038d57a30/doctr/models/recognition/crnn/pytorch.py#L129 (crnn is the only arch where it is fixed atm - we should change this)

master: https://github.com/mindee/doctr/blob/dd1fbbe69903321f30a86aec264b491038d57a30/doctr/models/recognition/master/pytorch.py#L62

But keep in mind we haven't experimented with different lengths so changing this can lead to unexpected behavior :)

Feb 23 '24 11:02 felixdittrich92

If I am to finetune this. Will it work if I change this max_length arg. Also is there a way to force the vocab? In paddleOCR i can define a dictionary of valid vocabs for my domain. Is it possible to do thishere as well?

Feb 23 '24 11:02 ajkdrag

Mh i took a short look for all other models yes for crnn not at the moment because this requires some changes in the CTC-PostProcessor .. What do you mean with with "force the vocab" ? :)

You can define your own vocab of course

If not already available in: https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py

Add it and afterwards you can add --vocab=your-vocab-name to the train command

Feb 23 '24 11:02 felixdittrich92

Got it thanks. I am currently finetuning the recog model as per the docs, but when I load the checkpoint, i get this error:

Error(s) in loading state_dict for CRNN:
	size mismatch for linear.weight: copying a param with shape torch.Size([127, 256]) from checkpoint, the shape in current model is torch.Size([124, 256]).
	size mismatch for linear.bias: copying a param with shape torch.Size([127]) from checkpoint, the shape in current model is torch.Size([124]).

f_recs_model="cheque-parser/notebooks/crnn_vgg16_bn_20240223-123911.pt"
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False)
reco_params = torch.load(f_recs_model, map_location="cpu")
reco_model.load_state_dict(reco_params)

My finetuning args:

sys.argv = ['crnn_vgg16_bn', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val/',
            '-b', '32',
            '--input_size', '32',
            '--pretrained']

Feb 23 '24 12:02 ajkdrag

from doctr.datasets import VOCABS
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=VOCABS["french"])

crnn_vgg16_bn was the first published model it is still trained on the old vocab (legacy_french) which contains less chars as the current default (for all other models -> french)

Feb 23 '24 13:02 felixdittrich92

Thanks. The loading works, but the model training isn't doing too well. Do you have any training recipes? Do i need to resize the images before running the finetune? Here's what I did: I took a few cheque images (~400), ran it through Google's Doc AI to get the bboxes and texts, and created the dataset in the correct format. I can see that some images are really small. Will that cause problems?

Also, I am running into another issue: vocab mismatch. The ground truth has some characters that aren't in the vocab. How to deal with those? Is there a helper script/library that can convert my annotations to the VOCAB["english"] or VOCAB["french"].

UPDATE: I fixed the vacab issue by using unidecode lib on my labels to convert them all to ascii.

Feb 23 '24 13:02 ajkdrag

While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3

sys.argv = ['parseq', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val',
            '-b', '64',
            '--epochs', '3',
            '--save-dir', '../../temp/saves/',
            '--pretrained']
            # '--show-samples']

Feb 24 '24 08:02 ajkdrag

While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3
sys.argv = ['parseq', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val',
            '-b', '64',
            '--epochs', '3',
            '--save-dir', '../../temp/saves/',
            '--pretrained']
            # '--show-samples']

In this case you need to increase the models max_length the max label length + 1 (in your case 34 if the longest label has 32 characters)

https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/references/recognition/train_pytorch.py#L233

model = recognition.__dict__[args.arch](pretrained=args.pretrained, vocab=vocab, max_length=...)

Same would then be required by loading the custom trained model :)

Feb 24 '24 11:02 felixdittrich92

And loading example:

import torch
from doctr.models import ocr_predictor, parseq
from doctr.datasets import VOCABS

# change to your defined vocab and the used max_length from training / vocab can also defined directly as string (but keep in mind to have the same order)
reco_model = parseq(pretrained=False, pretrained_backbone=False, vocab=VOCABS["german"], max_length=..)
reco_params = torch.load('<path_to_pt>', map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(det_arch='db_resnet50', reco_arch=reco_model, pretrained=True)

Feb 24 '24 11:02 felixdittrich92

About the pre-processing for recognition by default all samples are resized to 32x128 by keeping the aspect ratio while training we apply some augmentations randomly like noise / blur / shadow / perspective transform / etc.

Feb 24 '24 12:02 felixdittrich92

While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3
sys.argv = ['parseq', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val',
            '-b', '64',
            '--epochs', '3',
            '--save-dir', '../../temp/saves/',
            '--pretrained']
            # '--show-samples']
In this case you need to increase the models max_length the max label length + 1 (in your case 34 if the longest label has 32 characters)

https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/references/recognition/train_pytorch.py#L233
model = recognition.__dict__[args.arch](pretrained=args.pretrained, vocab=vocab, max_length=...)
Same would then be required by loading the custom trained model :)

I am trying to finetune parseq. When I add max_length=34 to the train_pytorch.py script, it errors about model shape not matching against checkpoint.

Feb 24 '24 15:02 ajkdrag

Mh in this case you need to modify the ignored keys from state dict at: (reinit random)

https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/doctr/models/recognition/parseq/pytorch.py#L479

to:

ignore_keys=["pos_queries", "embed.embedding.weight", "head.weight", "head.bias"],

NOTE: This will be only triggered if you use a different vocab as the default one (french)

Feb 25 '24 11:02 felixdittrich92

Ok got it. So if i use the french vocab with parseq, there shouldn't be any issue with fine-tuning?

Mar 06 '24 06:03 ajkdrag

DocTR finetuning error in CTC loss when target is long.

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

For security purposes, please check the contents of collect_env.py before running it.

Deep Learning backend