beyoung

Results 8 comments of beyoung

[llama-2-7b-80k-result.zip](https://github.com/FranxYao/Long-Context-Data-Engineering/files/14353636/llama-2-7b-80k-result.zip) This was the output we get yesterday. And we are trying the new branch and new prompt.

> Here is what I get from the new branch. > The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it...

![企业微信截图_3a89a116-7dd2-41dc-a94f-4a879f65e210](https://github.com/FranxYao/Long-Context-Data-Engineering/assets/34389681/34a009bc-ce9e-43a4-a7a7-1efc1d9295d1) We've implemented the new branch and observed improved results. Indeed, it's significantly better. However, the overall score is 0.848, which still presents a slight discrepancy compared to the results...

检索了一下原版测试集拼接后搜索“? ”有1259/1319个,开源的MOCK_GSM8K_TEST只有379/1415,这个比例感觉相似度比较有限?

那264到387这部分的题目呢?我看evalplus这部分的题目也是在评测中的

请问最近会出0.8.3修复这个问题吗?急