FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

rerank 模型训练代码的tokenizer问题

Open NLPJCL opened this issue 1 year ago • 1 comments

https://github.com/FlagOpen/FlagEmbedding/blob/bd38bd350054d0dba39ea8d602afac1fab141b35/FlagEmbedding/reranker/data.py#L42

代码中padding=False item = self.tokenizer.encode_plus( qry_encoding, doc_encoding, truncation=True, max_length=self.args.max_len, padding=False, ) 但是参数这里又说的是会pad。所以实际训练的时候,是padding了吗?

max_len: int = field(
    default=512,
    metadata={
        "help": "The maximum total input sequence length after tokenization for input text. Sequences longer "
                "than this will be truncated, sequences shorter will be padded."
    },
)

NLPJCL avatar Feb 22 '24 02:02 NLPJCL

https://github.com/FlagOpen/FlagEmbedding/blob/bd38bd350054d0dba39ea8d602afac1fab141b35/FlagEmbedding/reranker/data.py#L68 GroupCollator中会进行padding。

staoxiao avatar Feb 22 '24 09:02 staoxiao