unilm
unilm copied to clipboard
Finetune Trocr on my own dataset
May I ask how to finetune trocr on my own dataset? What's the format of dataset I need to prepare?
Hi,
Check out my repo containing tutorials for this: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR
@NielsRogge Thank you for Excellent work! Is there a tutorial to train for Japanese? https://github.com/microsoft/unilm/issues/612
Hi,
Yes that's possible. What you can do is initialize the weights of the encoder with those of ViT, and the weights of the decoder with those of a Japanese language model from the hub (I filtered on "ja"). You can then define the model as follows:
from transformers import VisionEncoderDecoderModel
# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model.
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"google/vit-base-patch16-224-in21k", "cl-tohoku/bert-base-japanese-char"
)
You can then follow my notebook for fine-tuning on custom data, assuming you have a collection of (image, text) pairs. Make sure to use the appropriate tokenizer for the decoder.
Does it matter if trocr only detects words not single line text?. Will it affect the results? I want to do custom training for handwritten text in another language with crop word dataset
@dhea1323 Word-level annotations work pretty well. Just do it.
Thank you for your response @wolfshow. If I want to do fine tuning for handwritten in Indonesian, What is the best pretrained model that can be used?. Then, What to do after setting up my own dataset?
Hi,
Yes that's possible. What you can do is initialize the weights of the encoder with those of ViT, and the weights of the decoder with those of a Japanese language model from the hub (I filtered on "ja"). You can then define the model as follows:
from transformers import VisionEncoderDecoderModel # initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. # Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "cl-tohoku/bert-base-japanese-char" )
You can then follow my notebook for fine-tuning on custom data, assuming you have a collection of (image, text) pairs. Make sure to use the appropriate tokenizer for the decoder.
@NielsRogge Where can I set which tokenizer to use in the tutorial? My goal is to train the whole thing in German.
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "dbmdz/bert-base-german-cased" )
Do I need to add only a row like:
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
or are further settings necessary?
And what are the minimum requirements for training data? (The more the better, but do you have experience what the minimum is?)
Thanks!
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
Yes, you need to use that in order to prepare the labels for the model (as the labels are the input_ids
of the text).
I would start with 100 (image, text) pairs, but it could be you need several hundreds/thousands of examples.
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
Yes, you need to use that in order to prepare the labels for the model (as the labels are the
input_ids
of the text).I would start with 100 (image, text) pairs, but it could be you need several hundreds/thousands of examples.
okay i tried to train it with 200 pairs, but the results were unfortunately not very good. Just as a question again, I have only
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased") processor.tokenizer = tokenizer
for the processor and:
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "dbmdz/bert-base-german-cased" )
for the Model. So with enough and even better data, the Model should work fine or. ?
Thanks a lot!
@jonas-da how did your training go? Did the model performance improve post increasing data? I am also planning to generate a model with German bert.
@gvlokesh, yes the training improved quite well. Which training data are you planning to use ? I struggled a lot of finding / generating German handwritten Dataset. If you have any hint for me, please let me know! I can share my results with you, when I finished my work.
could you please clarify how to fine tune using DistributedDataParallel?
@NielsRogge
The easiest way is probably to leverage HuggingFace Accelerate (which uses Distributed Data Parallel behind the scenes, if you're fine-tuning on multiple GPUs).
Hi i am trying to finetune TROCR with fairseq can someone share me format for dataset and how to prepare it
@yashprisma hello, did you find the format for dataset and finetune trocr successfully? if yes, could you share your experience?
I managed to fine-tune on my small Russian language dataset. But for some reason the results are better when not using Russian decoders (Bert, Roberta) as was explained in above comments but straight "microsoft/trocr-base-handwritten" without any change from the tutorial gives better results :shrug:
I managed to fine-tune on my small Russian language dataset. But for some reason the results are better when not using Russian decoders (Bert, Roberta) as was explained in above comments but straight "microsoft/trocr-base-handwritten" without any change from the tutorial gives better results shrug
I am currently trying to Fine tune on Bengali language, but the results i am getting are very bad. Can you tell me how are your results? and how is "microsoft/trocr-base-handwritten" giving better results in Russian language when it was pre-trained on English only? Also what tokenizer you're using? I believe it is a pre-trained russian language tokenizer is it?
Hello Guys!
I'm trying to do the same things but when I tried to run the fine tuning on the provided Google Colab by unilm an error occurred on the code below :
`from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments( predict_with_generate=True, evaluation_strategy="steps", per_device_train_batch_size=8, per_device_eval_batch_size=8, fp16=True, output_dir="./", logging_steps=2, save_steps=1000, eval_steps=200, )`
The specific error message is :
`---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
4 frames
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
1785 if not is_sagemaker_mp_enabled():
1786 if not is_accelerate_available(min_version="0.20.1"):
-> 1787 raise ImportError(
1788 "Using the Trainer
with PyTorch
requires accelerate>=0.20.1
: Please run pip install transformers[torch]
or pip install accelerate -U
"
1789 )
ImportError: Using the Trainer
with PyTorch
requires accelerate>=0.20.1
: Please run pip install transformers[torch]
or pip install accelerate -U
NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the "Open Examples" button below. ---------------------------------------------------------------------------`
What do you think causing this error to occur? I tried every possible way from updating it using !pip to creating a .env that some source suggested to do so. But the same error still occured. Any idea? Thank you for the help!
@danielhermawan02 In Google Colab I had to do this:
!pip install transformers accelerate
@aparij Thank you for the response but it's still not working. Instead, I'm using the PyTorch approach and surprisingly it worked well!
Hi,
Is there any tutorial how to finetune TrOCR with LoRA?
No but it's basically the same as this notebook, except that you need to swap your model with a PEFT model. In code:
from transformers import VisionEncoderDecoderModel
from peft import LoraConfig
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")
lora_config = LoraConfig(
target_modules=["q_proj", "k_proj"],
init_lora_weights=False
)
model.add_adapter(lora_config, adapter_name="adapter_1")
However, one would need to check which target_modules to set (by checking the linear layers of the Transformer decoder for instance), and set the learning rate accordingly.
Notice that if you print the parameters, you will see that they are frozen, except for the adapter weights:
for name, param in model.named_parameters():
print(name, param.requires_grad)
prints (among other things):
decoder.model.decoder.layers.11.encoder_attn.k_proj.base_layer.weight False
decoder.model.decoder.layers.11.encoder_attn.k_proj.base_layer.bias False
decoder.model.decoder.layers.11.encoder_attn.k_proj.lora_A.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.k_proj.lora_B.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.v_proj.weight False
decoder.model.decoder.layers.11.encoder_attn.v_proj.bias False
decoder.model.decoder.layers.11.encoder_attn.q_proj.base_layer.weight False
decoder.model.decoder.layers.11.encoder_attn.q_proj.base_layer.bias False
decoder.model.decoder.layers.11.encoder_attn.q_proj.lora_A.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.q_proj.lora_B.adapter_1.weight True
decoder.model.decoder.layers.11.encoder_attn.out_proj.weight False
decoder.model.decoder.layers.11.encoder_attn.out_proj.bias False
decoder.model.decoder.layers.11.encoder_attn_layer_norm.weight False
See https://huggingface.co/docs/transformers/peft.
I follow your notebook to finetune TrOCR in my own data, it work very well. Then I tried to apply LoRA on it:
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-small-stage1") config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", target_modules=["query","value"], ) model = get_peft_model(model, config) print_trainable_parameters(model)
I got : trainable params: 294912 || all params: 61891584 || trainable %: 0.4764977415992456
but, when I train it (I used the same way I fine tuned it without LoRA), I got error like this:
ValueError: The batch received was empty, your model won't be able to train on it. Double-check that your training dataset contains keys expected by the model: args,kwargs,label_ids,label.
Are there other parts that need to be setup?
Can it be fine-tune for German receipt images? Btw for only 100/200 receipt images.
@NielsRogge I've looked through your helpful notebooks for fine tuning of TrOCR base, but I have two questions.
- If I'm fine tuning the model on handwritten math expressions (custom dataset) which processor am I supposed to use?
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-stage1") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-stage1")
or
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-stage1")
Which parameters would you consider of importance and should be regarded when doing Hyperparameter Optimization?
thanks in advance!
Hi, regarding which processor to use, it doesn't matter, they both use the same vocabulary and image preprocessing settings.
For hyperparameter optimization, I would definitely experiment with number of epochs, learning rate, warmup schedule for the learning rate, etc.