vision-ai-checkup icon indicating copy to clipboard operation
vision-ai-checkup copied to clipboard

Incorrect Answer Sudoku

Open sebastianherreramonterrosa opened this issue 10 months ago • 0 comments

A few days ago, I create a Request od Sudoku Image.

Image

The evaluatios is here: https://visioncheckup.com/assessments/sudoku-puzzle-extraction/

I addition, the image is not displayed of the web, I think the evaluation is incorrect. Here my evaluation:

Model Exact board? Correct cells Accuracy Extracted grid (rows separated by a blank)
ChatGPT-4o 46 / 81 56.8 % ...4.8... .6.7..1.. 72.9.5.4. .97.4..3. ..7.5.8.. .896.5... 941.7.8.. 7..6..4.. ..2.7....
Claude 3.7 Sonnet 63 / 81 77.8 % ....48... .6...7..1 7.2.9.5.4 .9.74.3.. ..7.5.8.. .8.96.5.. 9.4.1.7.8 .7...6.4. ....27...
Claude 4 Opus 81 / 81 100 % ...4.8... .6..7..1. 7.2.9.5.4 .9.7.4.3. ..7.5.8.. .8.9.6.5. 9.4.1.7.8 .7..6..4. ...2.7...
Claude 4 Sonnet 58 / 81 71.6 % ...4.8... .6..7...1 7.2.9.5.4 .9.7.4..3 ..7.5.8.. ..8.9.6.5 9..4.1.7.8 .7...6..4 ...2.7...
GPT-4.1 38 / 81 46.9 % ..4.8.... .6.7..1.. 72.9.5.4. 9.7.4..3. ..7.5.8.. 8.9.6..5. 941.7.8.. 7..6..4.. ..2.7....
GPT-4.1 Mini 53 / 81 65.4 % ..4.8.... .6..7..1. 7.29.5.4. .9.7.4.3. .7.5.8... 8.9.6.5.. 9.4.1.7.8 .7..6..4. ..2.7....
Gemini 2.0 Flash 80 / 81 98.76 % ...4.8.. .6..7..1. 7.2.9.5.4 .9.7.4.3. ..7.5.8.. .8.9.6.5. 9.4.1.7.8 .7..6..4. ...2.7...
Gemini 2.0 Flash Lite 43 / 81 53.1 % ....4.8.. ..6..7.1. 7.2.9.5.4 ..9.7.4.3 ...7.5.8. ..8.9.6.5 9.4.1.7.8 ..7..6.4. ....2.7..
Gemini 2.5 Pro 81 / 81 100 % ...4.8... .6..7..1. 7.2.9.5.4 .9.7.4.3. ..7.5.8.. .8.9.6.5. 9.4.1.7.8 .7..6..4. ...2.7...
OpenAI O1 40 / 81 49.4 % ...4..8.. 6..7....1 72.9.5..4 97..4...3 7...5.8.. 89...6.5. 94178.... 764...... .......27
OpenAI O4 Mini 57 / 81 70.4 % ...4.8... .6..7..1. 72..9.5.4 .9.7.4.3. .7..5..8. ..8.9.6.5. 941.7..8. ..7.6..4. ....27...
Qwen 2.5 VL 7B 12 / 81 14.8 % . . 4 8 . . . 6 7 . . 1 . 7 2 9 5 4 . 9 7 4 3 . . 7 5 8 . 8 9 6 5 . 9 4 1 7 8 . 7 . 6 4 . 2 . 7 .

Correct answer: ...4.8... .6..7..1. 7.2.9.5.4 .9.7.4.3. ..7.5.8.. .8.9.6.5. 9.4.1.7.8 .7..6..4. ...2.7...

Only Claude 4 Opus and Gemini 2.5 Pro cpmplete the task correctly.