Added DotsOCR results of olmOCR-bench
- Added DotsOCR results of olmOCR-bench.
Final Summary with 95% Confidence Intervals:
dotsocr : Average Score: 69.3% ± 1.1% (average of per-JSONL scores)
absent : 79.6% average pass rate over 823 tests
baseline: 97.7% average pass rate over 1403 tests
math : 65.8% average pass rate over 3385 tests
order : 65.6% average pass rate over 1061 tests
present : 41.3% average pass rate over 721 tests
table : 84.8% average pass rate over 1020 tests
Results by JSONL file:
arxiv_math.jsonl : 65.2% (1909/2927 tests)
baseline : 97.8% (1363/1394 tests)
headers_footers.jsonl : 79.5% (604/760 tests)
long_tiny_text.jsonl : 46.2% (204/442 tests)
multi_column.jsonl : 72.9% (644/884 tests)
old_scans.jsonl : 38.6% (203/526 tests)
old_scans_math.jsonl : 69.7% (319/458 tests)
table_tests.jsonl : 84.8% (867/1022 tests)
Their own test results are better than olmocr - published here https://github.com/rednote-hilab/dots.ocr?tab=readme-ov-file#3-olmocr-bench
dots.ocr 82.1 64.2 88.3 40.9 94.1 82.4 81.2 99.5 79.1 ± 1.0
Yeah, that's why I don't exactly want to merge this. I am not sure what else is different between this code and theirs which is causing lower scores.
Open to contributions btw :D
I am trying to use your bench suite with this manual https://github.com/allenai/olmocr/tree/main/olmocr/bench It seems it needs an update - of course I don't have sglang but I do have vllm. And how do I run dots.ocr benchmark? I got to:
python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data Traceback (most recent call last): File "
", line 198, in _run_module_as_main File " ", line 88, in _run_code File "/home/kvb/olmocr/olmocr/bench/convert.py", line 246, in module = importlib.import_module(module_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File " ", line 1206, in _gcd_import File " ", line 1178, in _find_and_load File " ", line 1149, in _find_and_load_unlocked File " ", line 690, in _load_unlocked File " ", line 940, in exec_module File " ", line 241, in _call_with_frames_removed File "/home/kvb/olmocr/olmocr/bench/runners/run_olmocr_pipeline.py", line 7, in from olmocr.pipeline import ( ImportError: cannot import name 'sglang_server_host' from 'olmocr.pipeline' (/home/kvb/olmocr/olmocr/pipeline.py)
Yeah, sadly some bitrot has occurred...
In the latest release from yesterday, the python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data should run for the olmOCR case again.