COIL icon indicating copy to clipboard operation
COIL copied to clipboard

Did you remove punctuations before computing the document score?

Open namespace-Pt opened this issue 3 years ago • 3 comments

ColBERT removed punctuations in document because they think they are useless. I wonder if you removed punctuations when computing overlapping tokens between query and document?

namespace-Pt avatar Dec 23 '21 11:12 namespace-Pt

BTW, I think keeping the punctuations in both query and document would result in too long posting lists.

namespace-Pt avatar Dec 23 '21 11:12 namespace-Pt

The current code does not introduce special treatments to punctuations.

With respect to the current evaluation query sets, the queries typically do not include punctuations and therefore having punctuations will have little empirical effect on scores/processing speed: their inverted lists are rarely traversed.

luyug avatar Jan 02 '22 07:01 luyug

OK, thank you. I also wonder: how do you get your 7 negative samples, are they just ramdom sampling from negatives collected from triple file?

namespace-Pt avatar Jan 02 '22 15:01 namespace-Pt