olmocr
olmocr copied to clipboard
Is it possible to run on only a cpu?
I know in the readme it says
" Requirements:
- Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
- 30GB of free disk space "
But it also says
"Install sglang with flashinfer if you want to run inference on GPU."
Does that imply that it can be run on a cpu only? (albeit a bit slow.)
Thanks! :)
At the moment, it's not possible via pipeline.py, but you can do it if you just infer the model directly.
See: https://huggingface.co/allenai/olmOCR-7B-0225-preview
The model card has a code sample on how to call the model, which will work (slowly) on CPU. But you lose the advantages of the pipeline.py method like retries and output verification etc.
hm
Then I shall Wait
Thx for the detailed response! :)
Edit: testing this running on cpu only on my Mac M1 Pro 16b gb ram rn
Confirmed to work on CPU through the script you pointed me to! :D
(took a while tho lol)
Sadly the output appears truncated so something may of gone wrong looking it it...
(base) drew@wmughal-CN4D09397T test % python test.py
Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00, 6.16it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']
(base) drew@wmughal-CN4D09397T test %
Running this modified script:
import torch
import base64
import urllib.request
import json
import time
from io import BytesIO
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.prompts import build_finetuning_prompt
from olmocr.prompts.anchor import get_anchor_text
# Start time tracking
start_time = time.time()
# Initialize the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
"allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16
).eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Grab a sample PDF
pdf_path = "./paper.pdf"
urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", pdf_path)
# Render page 1 to an image
image_base64 = render_pdf_to_base64png(pdf_path, 1, target_longest_image_dim=1024)
# Build the prompt using document metadata
anchor_text = get_anchor_text(pdf_path, 1, pdf_engine="pdfreport", target_length=4000)
prompt = build_finetuning_prompt(anchor_text)
# Build the full prompt
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
],
}
]
# Apply the chat template and processor
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
# Prepare inputs for model
inputs = processor(
text=[text],
images=[main_image],
padding=True,
return_tensors="pt",
)
inputs = {key: value.to(device) for (key, value) in inputs.items()}
# Generate the output
output = model.generate(
**inputs,
temperature=0.8,
max_new_tokens=200, # Increased to avoid truncation
num_return_sequences=1,
do_sample=True,
)
# Decode the output
prompt_length = inputs["input_ids"].shape[1]
new_tokens = output[:, prompt_length:]
text_output = processor.tokenizer.batch_decode(new_tokens, skip_special_tokens=True)
# End time tracking
end_time = time.time()
processing_time = end_time - start_time # Time taken for execution
# Save output to text file
output_text_path = "output.txt"
with open(output_text_path, "w", encoding="utf-8") as f:
f.write(text_output[0]) # Save the first element as text
# Try saving output as JSON if possible
output_json_path = "output.json"
try:
parsed_output = json.loads(text_output[0]) # Try parsing as JSON
with open(output_json_path, "w", encoding="utf-8") as f:
json.dump(parsed_output, f, indent=4)
print(f"Output successfully saved as JSON: {output_json_path}")
except json.JSONDecodeError:
print("Output is not valid JSON, saved as plain text.")
# Print output & processing time
print("\nGenerated Output:\n", text_output[0])
print(f"\nProcessing Time: {processing_time:.2f} seconds")
# Confirm file saving
print(f"\nOutput saved to {output_text_path} and {output_json_path}")
Testing Result
(base) drew@wmughal-CN4D09397T test % python test.py
Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00, 5.95it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Output is not valid JSON, saved as plain text.
Generated Output:
{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke∗†ψ Christopher Clark∗† Sangho Lee† Rohun Tripathi† Yue Yang†\nJae Sung Parkψ Mohammadreza Salehiψ Niklas Muennighoff† Kyle Lo† Luca Soldaini†\nJiasen Lu† Taira Anderson† Erin Bransom† Kiana Ehsani† Huong Ngo†\nYenSung Chen† Ajay Patel† Mark Yatskar† Chris Callison-Burch† Andrew Head†\nRose Hendrix† Favyen Bastani† Eli VanderBilt† Nathan Lambert† Yvonne Chou†\nArnavi Chheda† Jenna Sparks† Sam
Processing Time: 3249.81 seconds
Output saved to output.txt and output.json
(base) drew@wmughal-CN4D09397T test %
Almost an hour to process a page, yikes!
yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model
Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷
yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model
Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷
Hey @DrewThomasson, we have gguf version on HF if you want to try. Link: https://huggingface.co/allenai/olmOCR-7B-0225-preview-GGUF
yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model
Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷
Same here, only got partial of the data and json file not generated. The good thing is now it took advantage of memory and out of gpu memory error was gone.
[Edit:] I tried to increase max_new_tokens from 200 to 500, and I was able to get a lot more data from the output. Total CPU+GPU memory consumption stayed unchanged. CUDA out of memory error if I gave 1500 to max_new_tokens. More time are taken at the same time. [Edit] I changed max_new_tokens to 800 and able to squeeze even more data, but processing time nearly doubled.
yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷
Hey @DrewThomasson, we have gguf version on HF if you want to try. Link: https://huggingface.co/allenai/olmOCR-7B-0225-preview-GGUF
How are we supposed to use that? I think it won't work on Ollama, right?
Testing Result
(base) drew@wmughal-CN4D09397T test % python test.py Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00, 5.95it/s] Using a slow image processor as
use_fastis unset and a slow processor was saved with this model.use_fast=Truewill be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False. Output is not valid JSON, saved as plain text.Generated Output: {"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke∗†ψ Christopher Clark∗† Sangho Lee† Rohun Tripathi† Yue Yang†\nJae Sung Parkψ Mohammadreza Salehiψ Niklas Muennighoff† Kyle Lo† Luca Soldaini†\nJiasen Lu† Taira Anderson† Erin Bransom† Kiana Ehsani† Huong Ngo†\nYenSung Chen† Ajay Patel† Mark Yatskar† Chris Callison-Burch† Andrew Head†\nRose Hendrix† Favyen Bastani† Eli VanderBilt† Nathan Lambert† Yvonne Chou†\nArnavi Chheda† Jenna Sparks† Sam
Processing Time: 3249.81 seconds
Output saved to output.txt and output.json (base) drew@wmughal-CN4D09397T test %
Hi, this is the speed of your code running on a single 4090 24G VRAM,faster more
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 21.23it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Output is not valid JSON, saved as plain text.
Generated Output:
{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke†ψ, Christopher Clark†ψ, Sangho Lee†, Rohun Tripathi†, Yue Yang†\nJae Sung Parkψ, Mohammadreza Salehiψ, Niklas Muennighoff†, Kyle Lo†, Luca Soldaini†\nJiasen Lu†, Taira Anderson†, Erin Bransom†, Kiana Ehsani†, Huong Ngo†\nYenSung Chen†, Ajay Patel†, Mark Yatskar†, Chris Callison-Burch†, Andrew Head†\nRose Hendrix†, Favyen Bastani†, Eli VanderBilt†, Nathan Lambert†,
Processing Time: 11.24 seconds
Output saved to output.txt and output.json
Closing this issue for now.