chameleon-llm
chameleon-llm copied to clipboard
Discrepancy in accuracy on minitest set for gpt-3.5-turbo
Hi @lupantech, thank you for your excellent work.
I observed inconsistent accuracies on the minitest set. Specifically, I got acc_average values of 49.29 for gpt-3.5-turbo and 46.93 for Llama-2-7b, while gpt-3.5's reported test set accuracy is 79.93.
Upon analyzing the "true_false" values in chameleon_chatgpt_test_cache.jsonl with matching pids in minitest set, I calculated an accuracy of 0.7948.
Could you help to clarify this discrepancy or share your minitest evaluation results, if available?