💡 [REQUEST] - Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation

Open zfj1998 opened this issue 1 year ago • 0 comments

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

摘要 | Summary

Congratulations on the impressive work!

I would like to suggest expanding the evaluation of visual reasoning to the HumanEval-V benchmark. This benchmark provides a more challenging set of tasks by introducing complex diagrams paired with coding challenges. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to generate code based on visual input, which better tests both instruction-following and open-ended generation abilities.

Key points for consideration:

HumanEval-V expands the reasoning scenarios with complex diagrams, pushing the limits of visual understanding.
The task format is tailored to code generation, making it a suitable benchmark for testing MLLMs’ ability to handle more structured, generative tasks.
Evaluating this benchmark will provide valuable insights into how well it handles visual reasoning combined with coding, which can be evaluated and rewarded through execution feedback.

基本示例 | Basic Example

You can find more information about the benchmark here: HumanEval-V Homepage.

缺陷 | Drawbacks

未解决问题 | Unresolved questions

No response

Feb 25 '25 03:02 zfj1998

💡 [REQUEST] - Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions