DSI-QG
DSI-QG copied to clipboard
NQ320k preprocessing?
Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper
To do this, I referred to your script from old repository, but I ran into the problem that simply by changing NUM_TRAIN=307000
and NUM_EVAL=7000
script terminates in the middle, probably due to the repeated titles (stop at ~107000).
for ind in rand_inds:
title = data[ind]['document']['title'] # we use title as the doc identifier to prevent two docs have the same text
if title not in title_set:
title_set.add(title)
Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?