donut
donut copied to clipboard
ASCII only output during training
Big fan of your work here
I noticed that in the initial steps of training donut was outputting some Japanese characters and Chinese characters. Which is ideal for multi lingual information extraction, But I was wondering if can we limit the output charecters of donut to only ASCII, so that the training will be much more smoother and inference time will also be less as there will be less charecters to classify on (when dealing with english only scenarios)
Hi, thank you for your interest on our work :) Let me get a quick/short answer first -> Yes, it would be possible by removing unnecessary tokens in the vocabulary of the tokenizer. (this will not require any additional training)
https://github.com/clovaai/donut/blob/e6623ad56c0e9f12a426dab2d8b2d65a39d64689/donut/model.py#L159-L161 Can I change the pretrained tokenizer from "hyunwoongko/asian-bart-ecjk" to "hyunwoongko/asian-bart-en". The later one is an english only decoder from the same repo, would that do the trick? Cause I am not sure on how to change the vocabulary
https://github.com/clovaai/donut/blob/e6623ad56c0e9f12a426dab2d8b2d65a39d64689/donut/model.py#L159-L161
Can I change the pretrained tokenizer from "hyunwoongko/asian-bart-ecjk" to "hyunwoongko/asian-bart-en". The later one is an english only decoder from the same repo, would that do the trick? Cause I am not sure on how to change the vocabulary
Yes, you can do that by writing "hyunwoongko/asian-bart-en". You can check this repo to see how to specify other models: https://github.com/hyunwoongko/asian-bart
https://github.com/clovaai/donut/blob/e6623ad56c0e9f12a426dab2d8b2d65a39d64689/donut/model.py#L159-L161
Can I change the pretrained tokenizer from "hyunwoongko/asian-bart-ecjk" to "hyunwoongko/asian-bart-en". The later one is an english only decoder from the same repo, would that do the trick? Cause I am not sure on how to change the vocabulary
Yes, you can do that by writing "hyunwoongko/asian-bart-en". You can check this repo to see how to specify other models: https://github.com/hyunwoongko/asian-bart
Sure, will give that a shot and let you know
Tried the above mentioned change but still observed other lang charecters in prediction during intermediate epochs
Prediction: Examination Examination Generation: Generation: General Examination: GENERAL EXamination: GENERAL APPEANANACE normal, pleasant, pleasant, well nourished, well developed, in no acute distress. اساسي, no suspicious lesions, normal, no rashes。讓他們, in no suspicious lesions, in no rashes。讓他們, in no suspirum, in no europeiskal評估,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們 Answer: <s_name>GAMBOA DELFINO</s_name><s_dob>12/24/1946</s_dob><s_account number>12727</s_account number><s_reason>1 MED REFILL</s_reason><s_assesments>Essential (primary) hypertension - I10 (Primary)<sep/>Unspecified atrial fibrillation - I48.91<sep/>Tachycardia, unspecified - R00.0<sep/>Impaired fasting glucose - R73.01</s_assesments> Normed ED: 0.862521891418564
I think there are many options to implement this feature. First one is to remove unnecessary tokens in the vocabulary. For this, you should update vocabulary of the tokenizer and corresponding embedding weights in the model (a.k.a. token/word embeddings). Or, just simply, using bad_words_ids
feature in generate
might more simple. Listing all undesirable token ids in bad_words_ids
will prevent the issue. I hope this comment is helpful to you :)
Will try it out, thanks for the tip!
@gwkrsrch Thank you for the great work. I have got a follow-up enquiry regarding the above issue.
Currently, the decoder of the model seemed to generate arbitrary repetitions of the cls_token (i.e. < s >) while training on DocVQA task.
As such, I have tried adding the cls_token to one of the bad_words_ids but the model continued to generate < s > in between its <s_answer> and </s_answer> tokens.
decoder_output = self.decoder.model.generate(decoder_input_ids=prompt_tensors,
encoder_outputs=encoder_outputs,
max_length=self.config.max_length,
early_stopping=True,
pad_token_id=self.decoder.tokenizer.pad_token_id,
eos_token_id=self.decoder.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[self.decoder.tokenizer.unk_token_id],[self.decoder.tokenizer.cls_token_id]],
return_dict_in_generate=True,output_attentions=return_attentions,)
Any advice / clarification on using bad_words_ids for the donut use-case would be much appreciated. This has been resolved, thank you