COIL
COIL copied to clipboard
Did you remove punctuations before computing the document score?
ColBERT removed punctuations in document because they think they are useless. I wonder if you removed punctuations when computing overlapping tokens between query and document?
BTW, I think keeping the punctuations in both query and document would result in too long posting lists.
The current code does not introduce special treatments to punctuations.
With respect to the current evaluation query sets, the queries typically do not include punctuations and therefore having punctuations will have little empirical effect on scores/processing speed: their inverted lists are rarely traversed.
OK, thank you. I also wonder: how do you get your 7 negative samples, are they just ramdom sampling from negatives collected from triple file?