beyoung comments

Results 8 comments of


                                            beyoung

It seems the result we get is not the same as the repo shows

[llama-2-7b-80k-result.zip](https://github.com/FranxYao/Long-Context-Data-Engineering/files/14353636/llama-2-7b-80k-result.zip) This was the output we get yesterday. And we are trying the new branch and new prompt.

It seems the result we get is not the same as the repo shows

> Here is what I get from the new branch. > The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it...

It seems the result we get is not the same as the repo shows

![企业微信截图_3a89a116-7dd2-41dc-a94f-4a879f65e210](https://github.com/FranxYao/Long-Context-Data-Engineering/assets/34389681/34a009bc-ce9e-43a4-a7a7-1efc1d9295d1) We've implemented the new branch and observed improved results. Indeed, it's significantly better. However, the overall score is 0.848, which still presents a slight discrepancy compared to the results...

关于MOCK_GSM8K_TEST question部分的问题

检索了一下原版测试集拼接后搜索“? ”有1259/1319个，开源的MOCK_GSM8K_TEST只有379/1415，这个比例感觉相似度比较有限？

[Feature] 目前是否有适配Codeforces、SWE Verified、Aider-Polyglot这些在R1中出现的数据集的计划呢？

蹲蹲回复

[Feature] 目前是否有适配Codeforces、SWE Verified、Aider-Polyglot这些在R1中出现的数据集的计划呢？

[Feature] 为什么SanitizedMBPPDataset是取7:264这个范围进行评测？

那264到387这部分的题目呢？我看evalplus这部分的题目也是在评测中的

[Bug]: v8.02源码不能正常工作

请问最近会出0.8.3修复这个问题吗？急