ProtTrans icon indicating copy to clipboard operation
ProtTrans copied to clipboard

T5 fine-tuning special tokens

Open tombosc opened this issue 11 months ago • 3 comments

Hello,

Firstly, thanks you all for your work.

I am struggling to understand how to fine-tune T5.

In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.

(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]

#113 also references another answer from #137 which is strange:

  • no pad token (problem, because then the first token is not modelled)
  • no eos token at all (problem in the decoder, because end of sequence token is not modelled)
  • the masked token embeddings have the same ID

There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.

Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):

  • Input: E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>.
  • label: <pad> E V Q L V E S G A E </s>.

Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.

edit: Another question: does the tokenizer include a postprocessor? It seems not: (Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'. Does it mean all those extra tokens need to be added manually, before calling tokenizer()?

tombosc avatar Nov 01 '24 12:11 tombosc