FlagEmbedding
FlagEmbedding copied to clipboard
rerank 模型训练代码的tokenizer问题
https://github.com/FlagOpen/FlagEmbedding/blob/bd38bd350054d0dba39ea8d602afac1fab141b35/FlagEmbedding/reranker/data.py#L42
代码中padding=False item = self.tokenizer.encode_plus( qry_encoding, doc_encoding, truncation=True, max_length=self.args.max_len, padding=False, ) 但是参数这里又说的是会pad。所以实际训练的时候,是padding了吗?
max_len: int = field(
default=512,
metadata={
"help": "The maximum total input sequence length after tokenization for input text. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
https://github.com/FlagOpen/FlagEmbedding/blob/bd38bd350054d0dba39ea8d602afac1fab141b35/FlagEmbedding/reranker/data.py#L68 GroupCollator中会进行padding。