donut icon indicating copy to clipboard operation
donut copied to clipboard

ASCII only output during training

Open TheSeriousProgrammer opened this issue 2 years ago • 8 comments

Big fan of your work here

I noticed that in the initial steps of training donut was outputting some Japanese characters and Chinese characters. Which is ideal for multi lingual information extraction, But I was wondering if can we limit the output charecters of donut to only ASCII, so that the training will be much more smoother and inference time will also be less as there will be less charecters to classify on (when dealing with english only scenarios)

TheSeriousProgrammer avatar Nov 01 '22 07:11 TheSeriousProgrammer

Hi, thank you for your interest on our work :) Let me get a quick/short answer first -> Yes, it would be possible by removing unnecessary tokens in the vocabulary of the tokenizer. (this will not require any additional training)

gwkrsrch avatar Nov 01 '22 07:11 gwkrsrch

https://github.com/clovaai/donut/blob/e6623ad56c0e9f12a426dab2d8b2d65a39d64689/donut/model.py#L159-L161 Can I change the pretrained tokenizer from "hyunwoongko/asian-bart-ecjk" to "hyunwoongko/asian-bart-en". The later one is an english only decoder from the same repo, would that do the trick? Cause I am not sure on how to change the vocabulary

TheSeriousProgrammer avatar Nov 01 '22 08:11 TheSeriousProgrammer

https://github.com/clovaai/donut/blob/e6623ad56c0e9f12a426dab2d8b2d65a39d64689/donut/model.py#L159-L161

Can I change the pretrained tokenizer from "hyunwoongko/asian-bart-ecjk" to "hyunwoongko/asian-bart-en". The later one is an english only decoder from the same repo, would that do the trick? Cause I am not sure on how to change the vocabulary

Yes, you can do that by writing "hyunwoongko/asian-bart-en". You can check this repo to see how to specify other models: https://github.com/hyunwoongko/asian-bart

josianem avatar Nov 02 '22 08:11 josianem

https://github.com/clovaai/donut/blob/e6623ad56c0e9f12a426dab2d8b2d65a39d64689/donut/model.py#L159-L161

Can I change the pretrained tokenizer from "hyunwoongko/asian-bart-ecjk" to "hyunwoongko/asian-bart-en". The later one is an english only decoder from the same repo, would that do the trick? Cause I am not sure on how to change the vocabulary

Yes, you can do that by writing "hyunwoongko/asian-bart-en". You can check this repo to see how to specify other models: https://github.com/hyunwoongko/asian-bart

Sure, will give that a shot and let you know

TheSeriousProgrammer avatar Nov 02 '22 08:11 TheSeriousProgrammer

Tried the above mentioned change but still observed other lang charecters in prediction during intermediate epochs Prediction: Examination Examination Generation: Generation: General Examination: GENERAL EXamination: GENERAL APPEANANACE normal, pleasant, pleasant, well nourished, well developed, in no acute distress. اساسي, no suspicious lesions, normal, no rashes。讓他們, in no suspicious lesions, in no rashes。讓他們, in no suspirum, in no europeiskal評估,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們,讓他們 Answer: <s_name>GAMBOA DELFINO</s_name><s_dob>12/24/1946</s_dob><s_account number>12727</s_account number><s_reason>1 MED REFILL</s_reason><s_assesments>Essential (primary) hypertension - I10 (Primary)<sep/>Unspecified atrial fibrillation - I48.91<sep/>Tachycardia, unspecified - R00.0<sep/>Impaired fasting glucose - R73.01</s_assesments> Normed ED: 0.862521891418564

TheSeriousProgrammer avatar Nov 08 '22 06:11 TheSeriousProgrammer

I think there are many options to implement this feature. First one is to remove unnecessary tokens in the vocabulary. For this, you should update vocabulary of the tokenizer and corresponding embedding weights in the model (a.k.a. token/word embeddings). Or, just simply, using bad_words_ids feature in generate might more simple. Listing all undesirable token ids in bad_words_ids will prevent the issue. I hope this comment is helpful to you :)

gwkrsrch avatar Nov 11 '22 07:11 gwkrsrch

Will try it out, thanks for the tip!

TheSeriousProgrammer avatar Nov 17 '22 02:11 TheSeriousProgrammer

@gwkrsrch Thank you for the great work. I have got a follow-up enquiry regarding the above issue. Currently, the decoder of the model seemed to generate arbitrary repetitions of the cls_token (i.e. < s >) while training on DocVQA task. As such, I have tried adding the cls_token to one of the bad_words_ids but the model continued to generate < s > in between its <s_answer> and </s_answer> tokens.

decoder_output = self.decoder.model.generate(decoder_input_ids=prompt_tensors,
encoder_outputs=encoder_outputs,
max_length=self.config.max_length,
early_stopping=True,
pad_token_id=self.decoder.tokenizer.pad_token_id,
eos_token_id=self.decoder.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[self.decoder.tokenizer.unk_token_id],[self.decoder.tokenizer.cls_token_id]],
return_dict_in_generate=True,output_attentions=return_attentions,)

Any advice / clarification on using bad_words_ids for the donut use-case would be much appreciated. This has been resolved, thank you

mckhang avatar Jan 24 '23 09:01 mckhang