ProtTrans
                                
                                 ProtTrans copied to clipboard
                                
                                    ProtTrans copied to clipboard
                            
                            
                            
                        T5 fine-tuning special tokens
Hello,
Firstly, thanks you all for your work.
I am struggling to understand how to fine-tune T5.
In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.
(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]
#113 also references another answer from #137 which is strange:
- no pad token (problem, because then the first token is not modelled)
- no eos token at all (problem in the decoder, because end of sequence token is not modelled)
- the masked token embeddings have the same ID
There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.
Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):
- Input: E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>.
- label: <pad> E V Q L V E S G A E </s>.
Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.
edit: Another question: does the tokenizer include a postprocessor? It seems not: (Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'. Does it mean all those extra tokens need to be added manually, before calling tokenizer()?