ProtTrans T5 fine-tuning special tokens

T5 fine-tuning special tokens

Open tombosc opened this issue 11 months ago • 3 comments

Hello,

Firstly, thanks you all for your work.

I am struggling to understand how to fine-tune T5.

In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.

(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]

#113 also references another answer from #137 which is strange:

no pad token (problem, because then the first token is not modelled)
no eos token at all (problem in the decoder, because end of sequence token is not modelled)
the masked token embeddings have the same ID

There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.

Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):

Input: E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>.
label: <pad> E V Q L V E S G A E </s>.

Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.

edit: Another question: does the tokenizer include a postprocessor? It seems not: (Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'. Does it mean all those extra tokens need to be added manually, before calling tokenizer()?

Nov 01 '24 12:11 tombosc

ProtTrans ProtTrans copied to clipboard

T5 fine-tuning special tokens

ProtTrans
ProtTrans copied to clipboard