olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Added DotsOCR results of olmOCR-bench

Open aman-17 opened this issue 2 months ago • 5 comments

  1. Added DotsOCR results of olmOCR-bench.
Final Summary with 95% Confidence Intervals:
dotsocr              : Average Score: 69.3% ± 1.1% (average of per-JSONL scores)
    absent  : 79.6% average pass rate over 823 tests
    baseline: 97.7% average pass rate over 1403 tests
    math    : 65.8% average pass rate over 3385 tests
    order   : 65.6% average pass rate over 1061 tests
    present : 41.3% average pass rate over 721 tests
    table   : 84.8% average pass rate over 1020 tests
    Results by JSONL file:
        arxiv_math.jsonl              : 65.2% (1909/2927 tests)
        baseline                      : 97.8% (1363/1394 tests)
        headers_footers.jsonl         : 79.5% (604/760 tests)
        long_tiny_text.jsonl          : 46.2% (204/442 tests)
        multi_column.jsonl            : 72.9% (644/884 tests)
        old_scans.jsonl               : 38.6% (203/526 tests)
        old_scans_math.jsonl          : 69.7% (319/458 tests)
        table_tests.jsonl             : 84.8% (867/1022 tests)

aman-17 avatar Sep 26 '25 19:09 aman-17

Their own test results are better than olmocr - published here https://github.com/rednote-hilab/dots.ocr?tab=readme-ov-file#3-olmocr-bench

dots.ocr 82.1 64.2 88.3 40.9 94.1 82.4 81.2 99.5 79.1 ± 1.0

montvid avatar Oct 16 '25 08:10 montvid

Yeah, that's why I don't exactly want to merge this. I am not sure what else is different between this code and theirs which is causing lower scores.

jakep-allenai avatar Oct 16 '25 15:10 jakep-allenai

Open to contributions btw :D

jakep-allenai avatar Oct 16 '25 15:10 jakep-allenai

I am trying to use your bench suite with this manual https://github.com/allenai/olmocr/tree/main/olmocr/bench It seems it needs an update - of course I don't have sglang but I do have vllm. And how do I run dots.ocr benchmark? I got to:

python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/kvb/olmocr/olmocr/bench/convert.py", line 246, in module = importlib.import_module(module_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 1206, in _gcd_import File "", line 1178, in _find_and_load File "", line 1149, in _find_and_load_unlocked File "", line 690, in _load_unlocked File "", line 940, in exec_module File "", line 241, in _call_with_frames_removed File "/home/kvb/olmocr/olmocr/bench/runners/run_olmocr_pipeline.py", line 7, in from olmocr.pipeline import ( ImportError: cannot import name 'sglang_server_host' from 'olmocr.pipeline' (/home/kvb/olmocr/olmocr/pipeline.py)

montvid avatar Oct 20 '25 14:10 montvid

Yeah, sadly some bitrot has occurred...

In the latest release from yesterday, the python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data should run for the olmOCR case again.

jakep-allenai avatar Oct 23 '25 18:10 jakep-allenai