docTTTTTquery icon indicating copy to clipboard operation
docTTTTTquery copied to clipboard

input template to T5 in V2

Open yixuan-qiao opened this issue 2 years ago • 0 comments

Hi,

I'm curious about the input template you use when generating the queries in V2. In V1, i found it in convert_msmarco_doc_to_t5_format.py

segment = doc_title + ' ' + ' '.join(sentences[i:i + args.max_length])

Maybe in V2, it seems like the following

segment = doc_title + '\n' + doc_headings + '\n' + ' '.join(sentences[i:i + args.max_length])

When training the doc2query-T5, we just use the qrels which each passage do not have other info like doc_title or doc_headings, but in query generation stage, we concatenate all infos about each passage, is there a distribution mismatch to affect the final performance? Or would it be better to use these additional infos?

yixuan-qiao avatar Jun 01 '22 03:06 yixuan-qiao