beyoung
beyoung
[llama-2-7b-80k-result.zip](https://github.com/FranxYao/Long-Context-Data-Engineering/files/14353636/llama-2-7b-80k-result.zip) This was the output we get yesterday. And we are trying the new branch and new prompt.
> Here is what I get from the new branch. > The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it...
 We've implemented the new branch and observed improved results. Indeed, it's significantly better. However, the overall score is 0.848, which still presents a slight discrepancy compared to the results...
检索了一下原版测试集拼接后搜索“? ”有1259/1319个,开源的MOCK_GSM8K_TEST只有379/1415,这个比例感觉相似度比较有限?
那264到387这部分的题目呢?我看evalplus这部分的题目也是在评测中的
请问最近会出0.8.3修复这个问题吗?急