opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Results 261 opencompass issues
Sort by recently updated
recently updated
newest added

### Describe the feature 只有7b,14b,72b。这个怎么办呢? ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

### Describe the feature 目前看到代码中强依赖于torch.cuda,希望可以更改接口并兼容npu卡,即兼容torch_npu。 ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

### Prerequisite - [X] I have searched [Issues](https://github.com/open-compass/opencompass/issues/) and [Discussions](https://github.com/open-compass/opencompass/discussions) but cannot get the expected help. - [X] The bug has not been fixed in the [latest version](https://github.com/open-compass/opencompass). ### Type...

### Describe the feature python run.py \ --models hf_llama2_7b \ --custom-dataset-path xxx/test_qa.jsonl \ --custom-dataset-data-type qa \ --custom-dataset-infer-method gen 使用这个命令得到的结果得分默认是accuracy。这意味着要完全相同才能算对么?如何替换成别的评估指标呢? 通过新增配置文件,学习成本有点高。。。 ### Will you implement it? - [ ] I would...

### Describe the feature Is there any plan to support PromptCBLUE, a Chinese medical LLM evaluation benchmark? https://github.com/michael-wzhu/PromptCBLUE ### Will you implement it? - [ ] I would like to...

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...

### Describe the feature Add documentation and example for NumWorkersPartitioner ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

### Describe the feature When I evaluated the vicuna-7b-v1.5 model using the mbpp_gen script, the score was 0 and most answers showed failed. Perhaps the evaluate script did not properly...

SubjectiveSummarizer not define, change to AlignmentBenchSummarizer.

### LiveCodeBench [Github](https://github.com/LiveCodeBench/LiveCodeBench) [HomePage](https://livecodebench.github.io/) 数据集优点: 1. humaneval 与 mbpp 题目过于基础, 该数据集更难 2. 来源于近期的code比赛,数据污染问题上还好很多 3. 除了**写代码**任务,还有 **结果预测**, **代码修复**, **代码执行**。更加全面的衡量一个模型的代码能力 ### 是否希望自己实现该功能? - [ ] 我希望自己来实现这一功能,并向 OpenCompass 贡献代码!