doctr icon indicating copy to clipboard operation
doctr copied to clipboard

DocTR finetuning error in CTC loss when target is long.

Open ajkdrag opened this issue 1 year ago • 13 comments

Bug description

If the target text is long, say about 14-20 characters, I am unable to fine-tune. Is there any flag to change the max-length?

Code snippet to reproduce the bug

import sys
sys.argv = ['crnn_vgg16_bn', 
            '--train_path', '../data/content/datasets/train/',
            '--val_path', '../data/content/datasets/val/',
            '-b', '4',
            '--input_size', '32',
            '--max-chars', '42']
            # '--show-samples']

Error traceback

Expected tensor to have size at least 42 at dimension 1, but got size 32 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)

Environment

For security purposes, please check the contents of collect_env.py before running it.

python collect_env.py --2024-02-23 17:06:46-- https://raw.githubusercontent.com/mindee/doctr/main/scripts/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 10644 (10K) [text/plain] Saving to: ‘collect_env.py’

collect_env.py 100%[=============================================>] 10.39K --.-KB/s in 0.001s

2024-02-23 17:06:46 (15.6 MB/s) - ‘collect_env.py’ saved [10644/10644]

Collecting environment information...

DocTR version: N/A TensorFlow version: N/A PyTorch version: N/A (torchvision N/A) OpenCV version: N/A OS: Ubuntu 22.04.1 LTS Python version: 3.8.0 Is CUDA available (TensorFlow): N/A Is CUDA available (PyTorch): N/A CUDA runtime version: 11.5.119 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Ti Nvidia driver version: 546.01 cuDNN version: Could not collect

Deep Learning backend

is_tf_available: False is_torch_available: True

ajkdrag avatar Feb 23 '24 11:02 ajkdrag

Hi @ajkdrag :wave:,

The max length for crnn_vgg16_bn is pinned to 32 (for example master has a max length of 50). https://github.com/mindee/doctr/blob/dd1fbbe69903321f30a86aec264b491038d57a30/doctr/models/recognition/crnn/pytorch.py#L129 (crnn is the only arch where it is fixed atm - we should change this)

master: https://github.com/mindee/doctr/blob/dd1fbbe69903321f30a86aec264b491038d57a30/doctr/models/recognition/master/pytorch.py#L62

But keep in mind we haven't experimented with different lengths so changing this can lead to unexpected behavior :)

felixdittrich92 avatar Feb 23 '24 11:02 felixdittrich92

If I am to finetune this. Will it work if I change this max_length arg. Also is there a way to force the vocab? In paddleOCR i can define a dictionary of valid vocabs for my domain. Is it possible to do thishere as well?

ajkdrag avatar Feb 23 '24 11:02 ajkdrag

Mh i took a short look for all other models yes for crnn not at the moment because this requires some changes in the CTC-PostProcessor .. What do you mean with with "force the vocab" ? :)

You can define your own vocab of course

If not already available in: https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py

Add it and afterwards you can add --vocab=your-vocab-name to the train command

felixdittrich92 avatar Feb 23 '24 11:02 felixdittrich92

Got it thanks. I am currently finetuning the recog model as per the docs, but when I load the checkpoint, i get this error:

Error(s) in loading state_dict for CRNN:
	size mismatch for linear.weight: copying a param with shape torch.Size([127, 256]) from checkpoint, the shape in current model is torch.Size([124, 256]).
	size mismatch for linear.bias: copying a param with shape torch.Size([127]) from checkpoint, the shape in current model is torch.Size([124]).
f_recs_model="cheque-parser/notebooks/crnn_vgg16_bn_20240223-123911.pt"
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False)
reco_params = torch.load(f_recs_model, map_location="cpu")
reco_model.load_state_dict(reco_params)

My finetuning args:

sys.argv = ['crnn_vgg16_bn', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val/',
            '-b', '32',
            '--input_size', '32',
            '--pretrained']

ajkdrag avatar Feb 23 '24 12:02 ajkdrag

from doctr.datasets import VOCABS
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=VOCABS["french"])

crnn_vgg16_bn was the first published model it is still trained on the old vocab (legacy_french) which contains less chars as the current default (for all other models -> french)

felixdittrich92 avatar Feb 23 '24 13:02 felixdittrich92

Thanks. The loading works, but the model training isn't doing too well. Do you have any training recipes? Do i need to resize the images before running the finetune? Here's what I did: I took a few cheque images (~400), ran it through Google's Doc AI to get the bboxes and texts, and created the dataset in the correct format. I can see that some images are really small. Will that cause problems?

Also, I am running into another issue: vocab mismatch. The ground truth has some characters that aren't in the vocab. How to deal with those? Is there a helper script/library that can convert my annotations to the VOCAB["english"] or VOCAB["french"].

UPDATE: I fixed the vacab issue by using unidecode lib on my labels to convert them all to ascii.

ajkdrag avatar Feb 23 '24 13:02 ajkdrag

While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3

sys.argv = ['parseq', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val',
            '-b', '64',
            '--epochs', '3',
            '--save-dir', '../../temp/saves/',
            '--pretrained']
            # '--show-samples']

ajkdrag avatar Feb 24 '24 08:02 ajkdrag

While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3

sys.argv = ['parseq', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val',
            '-b', '64',
            '--epochs', '3',
            '--save-dir', '../../temp/saves/',
            '--pretrained']
            # '--show-samples']

In this case you need to increase the models max_length the max label length + 1 (in your case 34 if the longest label has 32 characters)

https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/references/recognition/train_pytorch.py#L233

model = recognition.__dict__[args.arch](pretrained=args.pretrained, vocab=vocab, max_length=...)

Same would then be required by loading the custom trained model :)

felixdittrich92 avatar Feb 24 '24 11:02 felixdittrich92

And loading example:

import torch
from doctr.models import ocr_predictor, parseq
from doctr.datasets import VOCABS

# change to your defined vocab and the used max_length from training / vocab can also defined directly as string (but keep in mind to have the same order)
reco_model = parseq(pretrained=False, pretrained_backbone=False, vocab=VOCABS["german"], max_length=..)
reco_params = torch.load('<path_to_pt>', map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(det_arch='db_resnet50', reco_arch=reco_model, pretrained=True)

felixdittrich92 avatar Feb 24 '24 11:02 felixdittrich92

About the pre-processing for recognition by default all samples are resized to 32x128 by keeping the aspect ratio while training we apply some augmentations randomly like noise / blur / shadow / perspective transform / etc.

felixdittrich92 avatar Feb 24 '24 12:02 felixdittrich92

While finetuning parseq, i get this error: RuntimeError: The size of tensor a (33) must match the size of tensor b (31) at non-singleton dimension 3

sys.argv = ['parseq', 
            '--train_path', '../../datasets/rec/v1/train/',
            '--val_path', '../../datasets/rec/v1/val',
            '-b', '64',
            '--epochs', '3',
            '--save-dir', '../../temp/saves/',
            '--pretrained']
            # '--show-samples']

In this case you need to increase the models max_length the max label length + 1 (in your case 34 if the longest label has 32 characters)

https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/references/recognition/train_pytorch.py#L233

model = recognition.__dict__[args.arch](pretrained=args.pretrained, vocab=vocab, max_length=...)

Same would then be required by loading the custom trained model :)

I am trying to finetune parseq. When I add max_length=34 to the train_pytorch.py script, it errors about model shape not matching against checkpoint.

ajkdrag avatar Feb 24 '24 15:02 ajkdrag

Mh in this case you need to modify the ignored keys from state dict at: (reinit random)

https://github.com/mindee/doctr/blob/b2f9b17d66d3c39b35c83e6ac9d4caf35d4127c5/doctr/models/recognition/parseq/pytorch.py#L479

to:

ignore_keys=["pos_queries", "embed.embedding.weight", "head.weight", "head.bias"],

NOTE: This will be only triggered if you use a different vocab as the default one (french)

felixdittrich92 avatar Feb 25 '24 11:02 felixdittrich92

Ok got it. So if i use the french vocab with parseq, there shouldn't be any issue with fine-tuning?

ajkdrag avatar Mar 06 '24 06:03 ajkdrag