parseq icon indicating copy to clipboard operation
parseq copied to clipboard

Recommendation for training on new language ?

Open PSanni opened this issue 3 years ago • 44 comments

Any recommendations to train or fine-tune model on new language.

  1. Does training for new language (E.x. arabic) will work on existing pre-trained models ? Or it has to be from scratch.
  2. What is recommended amount of data for new language ?

PSanni avatar Jul 29 '22 09:07 PSanni

  1. Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml and use it during training. You should also update charset_test with the same set of characters used for training. The training command should look something like ./train.py charset=arabic, where configs/charset/arabic.yaml contains:
# @package _global_
model:
  charset_train: "..."
  charset_test: "..."
  1. I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).

baudm avatar Jul 29 '22 10:07 baudm

image

This is an old result for PARSeq. I was comparing the validation word accuracy for models trained exclusively on TextOCR (arbitrary) and its pose-corrected version (horizontal). DDP was used with 2 GPUs, so the effective iteration is the number shown in the x-axis multiplied by 2.

baudm avatar Jul 29 '22 10:07 baudm

  1. Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml and use it during training. You should also update charset_test with the same set of characters used for training. The training command should look something like ./train.py charset=arabic model.charset_test=<Arabic characters>
  2. I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).

@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. Many thank for your reponse.

phamkhactu avatar Jul 29 '22 11:07 phamkhactu

@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. Many thank for your reponse.

@phamkhactu Re: finetuning, please refer to my first comment.

For dataset preparation, please refer to clovaai/deep-text-recognition-benchmark on how to convert your image-text pairs into LMDB databases.

baudm avatar Jul 30 '22 05:07 baudm

  1. Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml and use it during training. You should also update charset_test with the same set of characters used for training. The training command should look something like ./train.py charset=arabic model.charset_test=<Arabic characters>
  2. I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).

When you say quality, does that mean quality of images ? or the term coverage ?

PSanni avatar Aug 01 '22 07:08 PSanni

I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml.

When I print out model.config while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py it does not output any chinese characters.

Not sure if this is an issue in read.py, train.py, test.py or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.

siddagra avatar Aug 01 '22 14:08 siddagra

For now I have used a dirty hack and used 94_full charset's symbols to represent Chinese characters. Mapping chinese characters to symbols in lmdb dataset and then back from symbols to Chinese characters during/after inference.

siddagra avatar Aug 01 '22 14:08 siddagra

@PSanni

When you say quality, does that mean quality of images ? or the term coverage ?

By dataset quality, I mean dataset size, diversity of samples, accuracy of labels, etc., not quality of images per se.

baudm avatar Aug 01 '22 17:08 baudm

I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml.

When I print out model.config while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py it does not output any chinese characters.

Not sure if this is an issue in read.py, train.py, test.py or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.

@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.

  1. First, you have to make sure that your training dataset is correctly prepared. Open the lmdb archives, query an image and its corresponding label, and check if the image and label are correct and intact.
  2. Disable Unicode normalization: data.normalize_unicode=false
  3. Probe the SceneTextDataModule instance. You can do it in any script (train.py, read.py, and test.py). You can check the labels returned by LmdbDataset using the train_dataset or val_dataset property of the data module instance. Make sure it returns the expected labels.
  4. Check if CharsetAdapter works. Create an instance using your charset, e.g. adapter = CharsetAdapter(charset), then test if it returns the correct output given Chinese text: adapter(some_text).

baudm avatar Aug 01 '22 18:08 baudm

Thanks a lot for your help!

I printed out several labels at several places, base.py, dataset.py, while lmdb encoding and decoding, etc.

base.py

 141           pred = self.charset_adapter(pred)
 142           print(pred)

dataset.py

                label = charset_adapter(label)
                print(label)

It seems to be working fine everywhere. The only issue seems to be the read.py itself perhaps. The argmax output of the variable p (logits from output) are not outputting any index higher than 23, even though the sequence should include characters indices up to 96.

tensor([[ 6,  9, 20,  9,  9,  1,  9,  3,  4,  2,  5, 12, 23,  8,  8, 12, 23, 12,
         23,  8, 21, 23, 13, 19, 13, 23,  8, 12, 12,  8, 12, 23, 23, 23,  8, 21,
          8,  0,  8,  8,  8, 20, 13,  9, 22, 21, 21, 20,  8, 18, 25, 21,  8, 23,
         22, 14, 16, 22, 19, 13, 14, 21, 17, 22, 23, 24, 13, 22, 25, 13, 23, 21,
         12,  9, 23, 23, 13, 23, 22, 23, 23, 22, 13,  9,  8, 23, 23, 23, 24, 23,
         13, 23, 23, 23, 21]], device='cuda:0')

I wanted to get results to report, but read.py is ignoring chinese characters and It somehow started working now. I think because I disabled unicode normalisation. Thanks!

test.py is giving this error:

dataset.py", line 72, in __del__
    self.env.close()
AttributeError: 'LmdbDataset' object has no attribute 'env'

Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data.

siddagra avatar Aug 02 '22 04:08 siddagra

@siddagra

It somehow started working now. I think because I disabled unicode normalisation. Thanks!

test.py is giving this error:

dataset.py", line 72, in __del__
    self.env.close()
AttributeError: 'LmdbDataset' object has no attribute 'env'

You're using old code. Pull the latest and update your dependencies.

Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data.

Sorry, this is not possible with PARSeq since its LM is internal. You can do this with ABINet, but honestly in my opinion, training on raw text has limited utility for STR since it is still primarily a visual recognition problem.

To alleviate the issue with your limited training data, I would suggest using a more extreme augmentation on the images: rotate them by 90, 180, or 270 degrees. You can do this by modifying augment.py or by directly adding the rotations inside the image transforms in module.py.

You may also lower the batch size in order to increase the variance and lessen the bias for each mini-batch. You could also play around with the value of K.

One STR-specific augmentation would be to form new training data by concatenating existing data. I have implemented a simple version of this and it works (but more experimental validation is needed). The algorithm is something like this:

  1. Choose a pair of samples.
  2. Allocate the image width proportional to the label length. That is, W = 128 * len(A) / (len(A) + len(B)) would be the allocation in pixels for image A.
  3. Resize the images to W x 32 and (128 - W) x 32 pixels, then concatenate them side by side.
  4. Concatenate the labels.

Lastly, you may try adding the augmentations implemented in straug.

baudm avatar Aug 02 '22 08:08 baudm

is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it.

PSanni avatar Aug 03 '22 13:08 PSanni

is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it.

https://github.com/baudm/parseq/issues/7#issuecomment-1198845845

baudm avatar Aug 03 '22 13:08 baudm

@PSanni @siddagra @bmusq As of commit b290950dad5a3dceb574cbc2d902765e1496ace2, finetuning is now officially supported. checkpoint parameter of test.py and read.py has been changed accordingly.

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights
# Resume from PL checkpoint
./train.py ckpt_path=outputs/parseq/.../last.ckpt

# Use pretrained weights for testing
./test.py pretrained=parseq  # same with read.py
# Or your own trained weights
./test.py outputs/parseq/.../last.ckpt  # same with read.py

baudm avatar Aug 04 '22 17:08 baudm

I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml. When I print out model.config while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py it does not output any chinese characters. Not sure if this is an issue in read.py, train.py, test.py or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.

@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.

  1. First, you have to make sure that your training dataset is correctly prepared. Open the lmdb archives, query an image and its corresponding label, and check if the image and label are correct and intact.
  2. Disable Unicode normalization: data.normalize_unicode=false
  3. Probe the SceneTextDataModule instance. You can do it in any script (train.py, read.py, and test.py). You can check the labels returned by LmdbDataset using the train_dataset or val_dataset property of the data module instance. Make sure it returns the expected labels.
  4. Check if CharsetAdapter works. Create an instance using your charset, e.g. adapter = CharsetAdapter(charset), then test if it returns the correct output given Chinese text: adapter(some_text).

I am running a synthetic data generator + imgaug to generate augmentations/distortions so that I can incorporate my own formats/language requirements. Any way to have it dynamically load images during data loading? instead of having to specify an LMDB dataset? or do you think that will make training too slow?

siddagra avatar Aug 07 '22 12:08 siddagra

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

airogachev avatar Aug 08 '22 09:08 airogachev

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

As far as I know, you should have your data formatted in the LMDB format. Make use of the create_lmdb_dataset.py script provided in tools folder. Make sure first that your data can actually be feed into this script. This probably requires you to create your own converter. In the same folder, check others python files which are converter themselves.

Once that's done, you have to put data.mdb and lock.mdb in the data\train\real folder. Thouroughly follow the folder structure as described in the Readme of the data section.

Now if, like me, you have downloaded all the datasets used in this paper, your data folder should already be well populated. Something you can do is create a new folder, lets say custom_data, follow the same architecture and put your .mdb files there.

Finally, in configs, open main.yaml and change root_dir to custom_data

I have done some finetuning myself and it works like a charm.

bmusq avatar Aug 09 '22 07:08 bmusq

@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?

airogachev avatar Aug 09 '22 07:08 airogachev

@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?

I used the pretrained weights and only my own data for tuning.

  • One with 10k synth data: jump from 48% accuracy to 97% on synthetic
  • One with 338 real data (I can hardly have more, train for just a few hours): since they are not annotated I cant make a % comparison but by using read.py I was able to visually confirm the improvement

As for other parameters:

  • Change the charset accordingly to your data especially if you are using extra special character. Otherwise, just use the 94_full charset since the pretrained weights has been trained on this one
  • I didn't touch the learning rate in the slight bit
  • I left the number of epoch at 20 and change the batch size to 64, mainly because I lack memory space.

bmusq avatar Aug 09 '22 08:08 bmusq

@bmusq so, it is possible to finetune the model even changing the charset, right?

airogachev avatar Aug 09 '22 08:08 airogachev

@bmusq so, it is possible to finetune the model even changing the charset, right?

I believe it is, yes. I think that is what @siddagra has done. Please see the top of this thread

bmusq avatar Aug 09 '22 08:08 bmusq

@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?

One more thing, if the amount of data you have is small, like it was my case, you might also want to change the val_check_interval in confgis\main.yaml. By default it is set to 1000, which means, you are doing a validation process every 1000 batches. Though, if you have not enough data per epoch, you will never trigger validation because you do not acutally have that much batches, especially with batches of size 384.

Something you can do is set val_check_interval between 0 and 1 and it will trigger validation after the given fraction of the training epoch. If you want to explore more about very specific parameters of the Trainer please follow this link: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html

bmusq avatar Aug 09 '22 08:08 bmusq

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

PSanni avatar Aug 09 '22 08:08 PSanni

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

bmusq avatar Aug 09 '22 09:08 bmusq

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

PSanni avatar Aug 09 '22 11:08 PSanni

@PSanni

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

If you don't change img_size and patch_size, you could still use the pretrained weights to initialize the encoder and the decoder. You need to do it manually though; refer to PyTorch docs on how to do partial loading of state dict. The character and position embeddings have to be trained from scratch if you change charset or model.max_label_length.

baudm avatar Aug 09 '22 18:08 baudm

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

I set up the data using the process I mentioned in: https://github.com/baudm/parseq/issues/7

siddagra avatar Aug 09 '22 18:08 siddagra

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did.

siddagra avatar Aug 10 '22 16:08 siddagra

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did.

agree, but i am trying to train it for multilingual use, so i have to use all the characters.

PSanni avatar Aug 11 '22 15:08 PSanni

@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?

image

phamkhactu avatar Aug 23 '22 15:08 phamkhactu

Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?

airogachev avatar Aug 25 '22 09:08 airogachev

Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?

airogachev avatar Aug 25 '22 12:08 airogachev

@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?

@phamkhactu yes, the model can be modified to train on long sentences. Off the top of my head, possible approaches are:

  1. Single input: very wide image. Need to adjust img_size and patch_size to accommodate the expected wide images. Possible issue: quadratic increase in compute requirements since MHA is O(n^2).
  2. Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?

@rogachevai STR operates on cropped image inputs. Models in this repo were trained on 128x32 px images.

Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?

PARSeq and the other models here are all character-based methods. If the shapes of the characters are roughly the same, e.g. o and ó, it's better to use the same literal for both. In fact, this is the default behavior--Unicode characters are normalized (data.normalize_unicode=True) such that accented characters are converted to their base (ASCII) form (accents and formatting are discarded).

baudm avatar Aug 25 '22 15:08 baudm

Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik.

siddagra avatar Aug 25 '22 15:08 siddagra

Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik.

Yes in my experiments, i found that word level text detection followed by parseq performed best for English and other 4 languages. However, it was not good with non-Latin languages when words are > 2.

PSanni avatar Aug 29 '22 06:08 PSanni

Models in this repo were trained on 128x32 px images.

So, you mean vit, don't you? All the images are cropped and all the crops are processed and you aggregate embeddings just the way VIT does it, right? At this point it looks like the initial image shape doesn't matter. I just noticed that images in the dataset that you used for the training have different shapes in different sets, so I wanted to figure out whether some particular pool of shapes exists.

airogachev avatar Aug 29 '22 08:08 airogachev