olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Is it possible to run on only a cpu?

Open DrewThomasson opened this issue 9 months ago • 11 comments

I know in the readme it says

" Requirements:

  • Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
  • 30GB of free disk space "

But it also says

"Install sglang with flashinfer if you want to run inference on GPU."

Does that imply that it can be run on a cpu only? (albeit a bit slow.)

Thanks! :)

DrewThomasson avatar Feb 26 '25 21:02 DrewThomasson

At the moment, it's not possible via pipeline.py, but you can do it if you just infer the model directly.

See: https://huggingface.co/allenai/olmOCR-7B-0225-preview

The model card has a code sample on how to call the model, which will work (slowly) on CPU. But you lose the advantages of the pipeline.py method like retries and output verification etc.

jakep-allenai avatar Feb 27 '25 04:02 jakep-allenai

hm

Then I shall Wait

Thx for the detailed response! :)

Edit: testing this running on cpu only on my Mac M1 Pro 16b gb ram rn

DrewThomasson avatar Feb 27 '25 04:02 DrewThomasson

Confirmed to work on CPU through the script you pointed me to! :D

(took a while tho lol)

Sadly the output appears truncated so something may of gone wrong looking it it...

(base) drew@wmughal-CN4D09397T test % python test.py      
Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00,  6.16it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']
(base) drew@wmughal-CN4D09397T test % 

DrewThomasson avatar Feb 27 '25 05:02 DrewThomasson

Running this modified script:

import torch
import base64
import urllib.request
import json
import time
from io import BytesIO
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.prompts import build_finetuning_prompt
from olmocr.prompts.anchor import get_anchor_text

# Start time tracking
start_time = time.time()

# Initialize the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16
).eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Grab a sample PDF
pdf_path = "./paper.pdf"
urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", pdf_path)

# Render page 1 to an image
image_base64 = render_pdf_to_base64png(pdf_path, 1, target_longest_image_dim=1024)

# Build the prompt using document metadata
anchor_text = get_anchor_text(pdf_path, 1, pdf_engine="pdfreport", target_length=4000)
prompt = build_finetuning_prompt(anchor_text)

# Build the full prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
        ],
    }
]

# Apply the chat template and processor
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))

# Prepare inputs for model
inputs = processor(
    text=[text],
    images=[main_image],
    padding=True,
    return_tensors="pt",
)
inputs = {key: value.to(device) for (key, value) in inputs.items()}

# Generate the output
output = model.generate(
    **inputs,
    temperature=0.8,
    max_new_tokens=200,  # Increased to avoid truncation
    num_return_sequences=1,
    do_sample=True,
)

# Decode the output
prompt_length = inputs["input_ids"].shape[1]
new_tokens = output[:, prompt_length:]
text_output = processor.tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

# End time tracking
end_time = time.time()
processing_time = end_time - start_time  # Time taken for execution

# Save output to text file
output_text_path = "output.txt"
with open(output_text_path, "w", encoding="utf-8") as f:
    f.write(text_output[0])  # Save the first element as text

# Try saving output as JSON if possible
output_json_path = "output.json"
try:
    parsed_output = json.loads(text_output[0])  # Try parsing as JSON
    with open(output_json_path, "w", encoding="utf-8") as f:
        json.dump(parsed_output, f, indent=4)
    print(f"Output successfully saved as JSON: {output_json_path}")
except json.JSONDecodeError:
    print("Output is not valid JSON, saved as plain text.")

# Print output & processing time
print("\nGenerated Output:\n", text_output[0])
print(f"\nProcessing Time: {processing_time:.2f} seconds")

# Confirm file saving
print(f"\nOutput saved to {output_text_path} and {output_json_path}")

DrewThomasson avatar Feb 27 '25 05:02 DrewThomasson

Testing Result

(base) drew@wmughal-CN4D09397T test % python test.py
Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00,  5.95it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Output is not valid JSON, saved as plain text.

Generated Output:
 {"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke∗†ψ Christopher Clark∗† Sangho Lee† Rohun Tripathi† Yue Yang†\nJae Sung Parkψ Mohammadreza Salehiψ Niklas Muennighoff† Kyle Lo† Luca Soldaini†\nJiasen Lu† Taira Anderson† Erin Bransom† Kiana Ehsani† Huong Ngo†\nYenSung Chen† Ajay Patel† Mark Yatskar† Chris Callison-Burch† Andrew Head†\nRose Hendrix† Favyen Bastani† Eli VanderBilt† Nathan Lambert† Yvonne Chou†\nArnavi Chheda† Jenna Sparks† Sam

Processing Time: 3249.81 seconds

Output saved to output.txt and output.json
(base) drew@wmughal-CN4D09397T test % 

DrewThomasson avatar Feb 27 '25 15:02 DrewThomasson

Almost an hour to process a page, yikes!

jakep-allenai avatar Feb 27 '25 20:02 jakep-allenai

yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model

Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷

DrewThomasson avatar Feb 27 '25 22:02 DrewThomasson

yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model

Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷

Hey @DrewThomasson, we have gguf version on HF if you want to try. Link: https://huggingface.co/allenai/olmOCR-7B-0225-preview-GGUF

aman-17 avatar Mar 04 '25 00:03 aman-17

yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model

Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷

Same here, only got partial of the data and json file not generated. The good thing is now it took advantage of memory and out of gpu memory error was gone.

[Edit:] I tried to increase max_new_tokens from 200 to 500, and I was able to get a lot more data from the output. Total CPU+GPU memory consumption stayed unchanged. CUDA out of memory error if I gave 1500 to max_new_tokens. More time are taken at the same time. [Edit] I changed max_new_tokens to 800 and able to squeeze even more data, but processing time nearly doubled.

lovebaby avatar Mar 10 '25 10:03 lovebaby

yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷

Hey @DrewThomasson, we have gguf version on HF if you want to try. Link: https://huggingface.co/allenai/olmOCR-7B-0225-preview-GGUF

How are we supposed to use that? I think it won't work on Ollama, right?

lovebaby avatar Mar 10 '25 10:03 lovebaby

Testing Result

(base) drew@wmughal-CN4D09397T test % python test.py Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00, 5.95it/s] Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. Output is not valid JSON, saved as plain text.

Generated Output: {"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke∗†ψ Christopher Clark∗† Sangho Lee† Rohun Tripathi† Yue Yang†\nJae Sung Parkψ Mohammadreza Salehiψ Niklas Muennighoff† Kyle Lo† Luca Soldaini†\nJiasen Lu† Taira Anderson† Erin Bransom† Kiana Ehsani† Huong Ngo†\nYenSung Chen† Ajay Patel† Mark Yatskar† Chris Callison-Burch† Andrew Head†\nRose Hendrix† Favyen Bastani† Eli VanderBilt† Nathan Lambert† Yvonne Chou†\nArnavi Chheda† Jenna Sparks† Sam

Processing Time: 3249.81 seconds

Output saved to output.txt and output.json (base) drew@wmughal-CN4D09397T test %

Hi, this is the speed of your code running on a single 4090 24G VRAM,faster more

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 21.23it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Output is not valid JSON, saved as plain text.

Generated Output:
 {"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke†ψ, Christopher Clark†ψ, Sangho Lee†, Rohun Tripathi†, Yue Yang†\nJae Sung Parkψ, Mohammadreza Salehiψ, Niklas Muennighoff†, Kyle Lo†, Luca Soldaini†\nJiasen Lu†, Taira Anderson†, Erin Bransom†, Kiana Ehsani†, Huong Ngo†\nYenSung Chen†, Ajay Patel†, Mark Yatskar†, Chris Callison-Burch†, Andrew Head†\nRose Hendrix†, Favyen Bastani†, Eli VanderBilt†, Nathan Lambert†,

Processing Time: 11.24 seconds

Output saved to output.txt and output.json 

Jacexm avatar Mar 10 '25 13:03 Jacexm

Closing this issue for now.

aman-17 avatar Jul 16 '25 18:07 aman-17