docTTTTTquery
docTTTTTquery copied to clipboard
input template to T5 in V2
Hi,
I'm curious about the input template you use when generating the queries in V2. In V1, i found it in convert_msmarco_doc_to_t5_format.py
segment = doc_title + ' ' + ' '.join(sentences[i:i + args.max_length])
Maybe in V2, it seems like the following
segment = doc_title + '\n' + doc_headings + '\n' + ' '.join(sentences[i:i + args.max_length])
When training the doc2query-T5, we just use the qrels which each passage do not have other info like doc_title or doc_headings, but in query generation stage, we concatenate all infos about each passage, is there a distribution mismatch to affect the final performance? Or would it be better to use these additional infos?