DSI-QG icon indicating copy to clipboard operation
DSI-QG copied to clipboard

NQ320k preprocessing?

Open kisozinov opened this issue 10 months ago • 4 comments

Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper image

To do this, I referred to your script from old repository, but I ran into the problem that simply by changing NUM_TRAIN=307000 and NUM_EVAL=7000 script terminates in the middle, probably due to the repeated titles (stop at ~107000).

for ind in rand_inds:
        title = data[ind]['document']['title']  # we use title as the doc identifier to prevent two docs have the same text
        if title not in title_set:
            title_set.add(title)

Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?

kisozinov avatar Apr 22 '24 07:04 kisozinov