olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Significant difference of performance between online demo and local inference

Open caolonghao opened this issue 7 months ago • 5 comments

🐛 Describe the bug

There is a significant difference of performance between online demo and local inference but I can't find the reason. For example, online demo processed this page as expected. However, the local deployed version found it JSON decode error. I checked the raw output of the model, it kept repeating some content and its result on table recognition is also worse than the online demo.

Are there any suggestions for me? error_page.pdf

Image

Versions

version 0.1.61, installed from source.

caolonghao avatar Apr 16 '25 10:04 caolonghao

Hmm, the online demo runs vllm vs sglang, but otherwise should be identical. There is some randomness is sampling, ex. have you tried running the PDF through several times each way, is it always wrong in the local version?

jakep-allenai avatar Apr 16 '25 19:04 jakep-allenai

Hmm, the online demo runs vllm vs sglang, but otherwise should be identical. There is some randomness is sampling, ex. have you tried running the PDF through several times each way, is it always wrong in the local version?

Yes, I have run it several times (by both setting larger max_try_num and manually run the code several times), but the problem persists.

caolonghao avatar Apr 17 '25 01:04 caolonghao

Closing this issue for now, please feel to reopen if you want to discuss further

aman-17 avatar Jul 17 '25 23:07 aman-17

@jakep-allenai What parameters like temperature etc are used in the online demo vs the ones provided in pipeline.py. I am also noticing differences, pretty sure there is something different on online demo as compared to the repo. Can we access the code for web demo ?

Atharva-Phatak avatar Jul 28 '25 12:07 Atharva-Phatak

Hey, I don't have the full demo code to share, but I will share what I easily can right now on the inference side:

https://gist.github.com/jakep-allenai/15c713545062ef458b7efa2101d69c06

It only has 3 retries at (0.1, 0.4, and 0.8 temperature) compared to a slower ramp up on the local inference side.

The demo is served with vllm 0.9.2 on an A100-80GB, but without flash infer installed in the container, which is a little different as well.

Do you have any English language files where you see an explicit difference that we can see?

jakep-allenai avatar Jul 28 '25 21:07 jakep-allenai