knn-transformers Step 4: Evaluating Models with knn：Incorrect perplexity (ppl)

Why is it that when I reproduce Step 4: Evaluating Models, the perplexity (ppl) I get from running knn-LM is around 17? Could you please explain why this is the case? I would greatly appreciate it if you could provide a response.

Dec 04 '24 19:12 Rubin-Wei

Hi Rubin, Thank you for your interest in our work.

Does it still happen when you use our datastore and our index?

Best, Uri

On Wed, Dec 4, 2024 at 2:41 PM Rubin @.***> wrote:

Why is it that when I reproduce Step 4: Evaluating Models, the perplexity (ppl) I get from running knn-LM is around 17? Could you please explain why this is the case? I would greatly appreciate it if you could provide a response.

— Reply to this email directly, view it on GitHub https://github.com/neulab/knn-transformers/issues/18, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMBTE24EH4CJMZOL3GT2D5LGHAVCNFSM6AAAAABTA5C5N6VHI2DSMVQWIX3LMV43ASLTON2WKOZSG4YTQNRUGEYTGMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Dec 05 '24 13:12 urialon

Dear author,

Thank you very much for your response! I am using the neulab/gpt2-finetuned-wikitext103 model, and the dataset is Wikitext-103. The index and vals files I am using are gpt2/index_gpt2_116988150_768.indexed and gpt2/dstore_gpt2_116988150_768_vals.npy, respectively, from the link https://knn-transformers.s3.amazonaws.com/index.html.

However, when using the --knn option, the perplexity (PPL) of GPT2 is 17.34, which is significantly higher than the 12.57 you provided. I was wondering if you know what might be causing this discrepancy?

Another question is that in your article, the method RetoMaton is compared using Foss, and according to your image, a smaller Foss value indicates a lower PPL and better performance. However, for the knn-LM, it seems there is no hyperparameter related to Foss in the code.

If I could receive your reply, it would be greatly appreciated.

Dec 05 '24 13:12 Rubin-Wei

I set the parameter knn_gpu to False with a perplexity value of 12.5734, and when I set knn_gpu to True, the perplexity value becomes 17.3421.

Sep 01 '25 07:09 Zhuang16

Hi, I am also having the same issue - I get a perplexity score of 17 with knn-lm on the wikipedia dataset.

Retomaton is within the ballpark of 13 and 12, improving the finetuned baseline, so this is in agreement when I change the min_knns parameter. Can I please clarify this too? I am using knn-gpu = True and I built the wiki datastore using the code in the repository.

I suspect that the loss calculation is different between knnlm and retomaton and this affects perplexity. When I compare generated outputs of knnlm and retomaton, I get almost identical ROUGE scores, but the perplexity of the knn-LM model is still significantly higher.

Kindly seeking help and I hope this question is reasonable. I am not a domain expert so if I'm doing something wrong please let me know.

Nov 10 '25 02:11 purswaninuri

I set the parameter knn_gpu to False with a perplexity value of 12.5734, and when I set knn_gpu to True, the perplexity value becomes 17.3421.

Is my issue a GPU vs CPU setup consideration then?

Nov 10 '25 02:11 purswaninuri

Hi folks, Thank you for your interest in our work.

Unfortunately, this codebase is 4 years old. I don't have the capacity to investigate why KNN-GPU gives different results than KNN-CPU, and I don't have access to the same servers. Many things have probably changed in Faiss library, which was unstable to begin with.

If KNN-CPU works, I recommend checking the Faiss documentation if there is anything that can make the GPU version equivalent.

Best, Uri

Nov 10 '25 03:11 urialon

Thanks for your help and prompt response. Noted on this.

Nov 10 '25 03:11 purswaninuri

Hi, I was able to reproduce your perplexity score of 12.57 for the GPT-2 finetuned model from the HF page on CPU settings for knnlm (with default wrapper parameters).

Nov 13 '25 02:11 purswaninuri