sgpt cannot reproduce leaderboard result

cannot reproduce leaderboard result

Open hongshanli23 opened this issue 2 years ago • 1 comments

Hello Niklas, I have a question regarding reproducing SGPT's result. On the mteb leaderboard, the 125M-weightedmean-msmarco-specb-bitfit model achieve 12.21 NDCG@10 on SCIDOCS. However, I wasn't able reproduce the result following the the instruction here. In my benchmarking, I got a very low number (0.00085). I think the instructions are a bit off.

My second question is that I couldn't really understand the idea behind these block . Looking at how you tokenize queries and corpus , it is much more natural to me to simply wrap queries text by [ ] and corpus text by { } before tokenizing them. I got an NDCG of 11.09 for preprocessing SCIDOCS this way, which is much closer to the reported number on the leaderboard.

Jan 29 '23 02:01 hongshanli23

Hello Niklas, I have a question regarding reproducing SGPT's result. On the mteb leaderboard, the 125M-weightedmean-msmarco-specb-bitfit model achieve 12.21 NDCG@10 on SCIDOCS. However, I wasn't able reproduce the result following the the instruction here. In my benchmarking, I got a very low number (0.00085). I think the instructions are a bit off.

I made a small mistake uploading the script when I was trying to combine this model & this model. I updated it & here's a colab that reproduces the 12.21 NDCG@10 exactly.

My second question is that I couldn't really understand the idea behind these block . Looking at how you tokenize queries and corpus , it is much more natural to me to simply wrap queries text by [ ] and corpus text by { } before tokenizing them. I got an NDCG of 11.09 for preprocessing SCIDOCS this way, which is much closer to the reported number on the leaderboard.

Yes you can do that, but it will produce slightly worse scores like this model. This is because, the brackets [ ] and { } may get intermingled with other tokens upon tokenization. For example, [This is a sentence] might be tokenized as "[This", " is", " a", " sent", "ence", "]". But we would like the special brackets to always be separate tokens and not interfere with the text, i.e. "[", "This".... Thus, the script uses special tokens (SOS) that are added to the vocabulary and will hence be tokenized separately. Prior to feeding the tokens to the model, these are then replaced with the actual bracket tokens here.

Jan 29 '23 11:01 Muennighoff

sgpt sgpt copied to clipboard

cannot reproduce leaderboard result

sgpt
sgpt copied to clipboard