unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Finetune Trocr on my own dataset

Open GivanTsai opened this issue 2 years ago • 26 comments

May I ask how to finetune trocr on my own dataset? What's the format of dataset I need to prepare?

GivanTsai avatar Feb 18 '22 07:02 GivanTsai

Hi,

Check out my repo containing tutorials for this: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR

NielsRogge avatar Mar 07 '22 10:03 NielsRogge

@NielsRogge Thank you for Excellent work! Is there a tutorial to train for Japanese? https://github.com/microsoft/unilm/issues/612

GitHub30 avatar Mar 07 '22 15:03 GitHub30

Hi,

Yes that's possible. What you can do is initialize the weights of the encoder with those of ViT, and the weights of the decoder with those of a Japanese language model from the hub (I filtered on "ja"). You can then define the model as follows:

from transformers import VisionEncoderDecoderModel

# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. 
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "cl-tohoku/bert-base-japanese-char"
)

You can then follow my notebook for fine-tuning on custom data, assuming you have a collection of (image, text) pairs. Make sure to use the appropriate tokenizer for the decoder.

NielsRogge avatar Mar 07 '22 16:03 NielsRogge

Does it matter if trocr only detects words not single line text?. Will it affect the results? I want to do custom training for handwritten text in another language with crop word dataset

dhea1323 avatar Apr 25 '22 05:04 dhea1323

@dhea1323 Word-level annotations work pretty well. Just do it.

wolfshow avatar Apr 25 '22 06:04 wolfshow

Thank you for your response @wolfshow. If I want to do fine tuning for handwritten in Indonesian, What is the best pretrained model that can be used?. Then, What to do after setting up my own dataset?

dhea1323 avatar Apr 26 '22 01:04 dhea1323

Hi,

Yes that's possible. What you can do is initialize the weights of the encoder with those of ViT, and the weights of the decoder with those of a Japanese language model from the hub (I filtered on "ja"). You can then define the model as follows:

from transformers import VisionEncoderDecoderModel

# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. 
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "cl-tohoku/bert-base-japanese-char"
)

You can then follow my notebook for fine-tuning on custom data, assuming you have a collection of (image, text) pairs. Make sure to use the appropriate tokenizer for the decoder.

@NielsRogge Where can I set which tokenizer to use in the tutorial? My goal is to train the whole thing in German.

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "dbmdz/bert-base-german-cased" )

Do I need to add only a row like:

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")

or are further settings necessary?

And what are the minimum requirements for training data? (The more the better, but do you have experience what the minimum is?)

Thanks!

jonas-da avatar May 31 '22 07:05 jonas-da

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")

Yes, you need to use that in order to prepare the labels for the model (as the labels are the input_ids of the text).

I would start with 100 (image, text) pairs, but it could be you need several hundreds/thousands of examples.

NielsRogge avatar May 31 '22 07:05 NielsRogge

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")

Yes, you need to use that in order to prepare the labels for the model (as the labels are the input_ids of the text).

I would start with 100 (image, text) pairs, but it could be you need several hundreds/thousands of examples.

okay i tried to train it with 200 pairs, but the results were unfortunately not very good. Just as a question again, I have only

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased") processor.tokenizer = tokenizer

for the processor and:

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "dbmdz/bert-base-german-cased" )

for the Model. So with enough and even better data, the Model should work fine or. ?

Thanks a lot!

jonas-da avatar Jul 13 '22 12:07 jonas-da

@jonas-da how did your training go? Did the model performance improve post increasing data? I am also planning to generate a model with German bert.

gvlokesh avatar Jul 19 '22 06:07 gvlokesh

@gvlokesh, yes the training improved quite well. Which training data are you planning to use ? I struggled a lot of finding / generating German handwritten Dataset. If you have any hint for me, please let me know! I can share my results with you, when I finished my work.

jonas-da avatar Jul 28 '22 08:07 jonas-da

could you please clarify how to fine tune using DistributedDataParallel?

@NielsRogge

HebaGamalElDin avatar Sep 11 '22 22:09 HebaGamalElDin

The easiest way is probably to leverage HuggingFace Accelerate (which uses Distributed Data Parallel behind the scenes, if you're fine-tuning on multiple GPUs).

NielsRogge avatar Sep 13 '22 07:09 NielsRogge

Hi i am trying to finetune TROCR with fairseq can someone share me format for dataset and how to prepare it

yashprisma avatar Sep 13 '22 10:09 yashprisma

@yashprisma hello, did you find the format for dataset and finetune trocr successfully? if yes, could you share your experience?

linglongxian avatar Jul 21 '23 05:07 linglongxian

I managed to fine-tune on my small Russian language dataset. But for some reason the results are better when not using Russian decoders (Bert, Roberta) as was explained in above comments but straight "microsoft/trocr-base-handwritten" without any change from the tutorial gives better results :shrug:

aparij avatar Jul 27 '23 13:07 aparij

I managed to fine-tune on my small Russian language dataset. But for some reason the results are better when not using Russian decoders (Bert, Roberta) as was explained in above comments but straight "microsoft/trocr-base-handwritten" without any change from the tutorial gives better results shrug

I am currently trying to Fine tune on Bengali language, but the results i am getting are very bad. Can you tell me how are your results? and how is "microsoft/trocr-base-handwritten" giving better results in Russian language when it was pre-trained on English only? Also what tokenizer you're using? I believe it is a pre-trained russian language tokenizer is it?

AnustupOCR avatar Aug 01 '23 06:08 AnustupOCR

Hello Guys!
I'm trying to do the same things but when I tried to run the fine tuning on the provided Google Colab by unilm an error occurred on the code below :
`from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments( predict_with_generate=True, evaluation_strategy="steps", per_device_train_batch_size=8, per_device_eval_batch_size=8, fp16=True, output_dir="./", logging_steps=2, save_steps=1000, eval_steps=200, )`

The specific error message is :
`--------------------------------------------------------------------------- ImportError Traceback (most recent call last) in <cell line: 3>() 1 from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments 2 ----> 3 training_args = Seq2SeqTrainingArguments( 4 predict_with_generate=True, 5 evaluation_strategy="steps",

4 frames /usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self) 1785 if not is_sagemaker_mp_enabled(): 1786 if not is_accelerate_available(min_version="0.20.1"): -> 1787 raise ImportError( 1788 "Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U" 1789 )

ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U


NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the "Open Examples" button below. ---------------------------------------------------------------------------`

What do you think causing this error to occur? I tried every possible way from updating it using !pip to creating a .env that some source suggested to do so. But the same error still occured. Any idea? Thank you for the help!

danielhermawan02 avatar Nov 08 '23 06:11 danielhermawan02

@danielhermawan02 In Google Colab I had to do this:

!pip install transformers accelerate

aparij avatar Nov 08 '23 13:11 aparij

@aparij Thank you for the response but it's still not working. Instead, I'm using the PyTorch approach and surprisingly it worked well!

danielhermawan02 avatar Nov 13 '23 16:11 danielhermawan02

Hi,

Is there any tutorial how to finetune TrOCR with LoRA?

bustamiyusoef avatar Dec 29 '23 08:12 bustamiyusoef

No but it's basically the same as this notebook, except that you need to swap your model with a PEFT model. In code:

from transformers import VisionEncoderDecoderModel
from peft import LoraConfig

model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    init_lora_weights=False
)

model.add_adapter(lora_config, adapter_name="adapter_1")

However, one would need to check which target_modules to set (by checking the linear layers of the Transformer decoder for instance), and set the learning rate accordingly.

Notice that if you print the parameters, you will see that they are frozen, except for the adapter weights:

for name, param in model.named_parameters():
    print(name, param.requires_grad)

prints (among other things):

decoder.model.decoder.layers.11.encoder_attn.k_proj.base_layer.weight False
decoder.model.decoder.layers.11.encoder_attn.k_proj.base_layer.bias False
decoder.model.decoder.layers.11.encoder_attn.k_proj.lora_A.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.k_proj.lora_B.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.v_proj.weight False
decoder.model.decoder.layers.11.encoder_attn.v_proj.bias False
decoder.model.decoder.layers.11.encoder_attn.q_proj.base_layer.weight False
decoder.model.decoder.layers.11.encoder_attn.q_proj.base_layer.bias False
decoder.model.decoder.layers.11.encoder_attn.q_proj.lora_A.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.q_proj.lora_B.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.out_proj.weight False
decoder.model.decoder.layers.11.encoder_attn.out_proj.bias False
decoder.model.decoder.layers.11.encoder_attn_layer_norm.weight False

See https://huggingface.co/docs/transformers/peft.

NielsRogge avatar Dec 29 '23 09:12 NielsRogge

I follow your notebook to finetune TrOCR in my own data, it work very well. Then I tried to apply LoRA on it:

model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-small-stage1") config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", target_modules=["query","value"], ) model = get_peft_model(model, config) print_trainable_parameters(model)

I got : trainable params: 294912 || all params: 61891584 || trainable %: 0.4764977415992456

but, when I train it (I used the same way I fine tuned it without LoRA), I got error like this:

ValueError: The batch received was empty, your model won't be able to train on it. Double-check that your training dataset contains keys expected by the model: args,kwargs,label_ids,label.

Are there other parts that need to be setup?

bustamiyusoef avatar Dec 29 '23 10:12 bustamiyusoef

Can it be fine-tune for German receipt images? Btw for only 100/200 receipt images.

huilunc avatar Jan 28 '24 12:01 huilunc

@NielsRogge I've looked through your helpful notebooks for fine tuning of TrOCR base, but I have two questions.

  1. If I'm fine tuning the model on handwritten math expressions (custom dataset) which processor am I supposed to use?

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-stage1") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-stage1")

or

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-stage1")

Which parameters would you consider of importance and should be regarded when doing Hyperparameter Optimization?

thanks in advance!

Mohammedelkhalil avatar Apr 12 '24 03:04 Mohammedelkhalil

Hi, regarding which processor to use, it doesn't matter, they both use the same vocabulary and image preprocessing settings.

For hyperparameter optimization, I would definitely experiment with number of epochs, learning rate, warmup schedule for the learning rate, etc.

NielsRogge avatar Apr 12 '24 13:04 NielsRogge