chameleon-llm Discrepancy in accuracy on minitest set for gpt-3.5-turbo

Discrepancy in accuracy on minitest set for gpt-3.5-turbo

Open jameszhou-gl opened this issue 8 months ago • 0 comments

Hi @lupantech, thank you for your excellent work.

I observed inconsistent accuracies on the minitest set. Specifically, I got acc_average values of 49.29 for gpt-3.5-turbo and 46.93 for Llama-2-7b, while gpt-3.5's reported test set accuracy is 79.93.

Upon analyzing the "true_false" values in chameleon_chatgpt_test_cache.jsonl with matching pids in minitest set, I calculated an accuracy of 0.7948.

Could you help to clarify this discrepancy or share your minitest evaluation results, if available?

Oct 08 '23 07:10 jameszhou-gl

chameleon-llm chameleon-llm copied to clipboard

Discrepancy in accuracy on minitest set for gpt-3.5-turbo

chameleon-llm
chameleon-llm copied to clipboard