olmocr Training command and process

It would be great if the commands and process for training the models are made available. I guess some of us trying to train smaller Qwen VL variants which could be really useful for resource constraint devices. However, the process to train the model is not clear.

Great work with the entire project.

Mar 13 '25 05:03 sovit-123

Hey, yeah, sorry, to package up the dataset on hugging face we put it into a different format, which isn't well supported by the tools in this repo yet. You can basically download everything from https://huggingface.co/datasets/allenai/olmOCR-mix-0225 , extract all the tarball'ed pdfs, and then basically do the opposite of what https://github.com/allenai/olmocr/blob/main/olmocr/train/hf/convertjsontoparquet.py is doing to get the parquets back to JSON. Then you can use the training code.

Or if you have custom data to fine tune on, you can use https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py to build queries for openai's API, then run them.

After olmocr-bench is done, we will train a new model and update the training code to be simpler. Likely we will have some nice new features too.

Mar 13 '25 16:03 jakep-allenai

Thanks. Yes, I guessed the format is somewhat different from that given in the HF dataset, as I saw that the YAML files and the datasets are probably getting loaded from S3.

Just to be sure, is this the training command?

python -m olmocr.train.train --model.name_or_path Qwen/Qwen2-VL-2B --model.arch causal --train_data.sources allenai/olmOCR-mix-0225

If so, can you please just comment here, finally, which paths should go for --train_data.sources?

Mar 14 '25 01:03 sovit-123

Once you unpack the pdfs and convert the parquet back to json (sorry I dont have a guide for this step yet), then you can see the commands used in https://github.com/allenai/olmocr/blob/main/scripts/qwen2vl-7b-gantry.sh for an example.

The core of it is here: python -m olmocr.train.train -c olmocr/train/config/qwen2vl-7b.yaml --num_proc 64

But if you want 8GPUs you'll need to use accelerate as in the shell script.

And then edit your config file https://github.com/allenai/olmocr/blob/main/olmocr/train/config/qwen2vl-7b.yaml

Mar 14 '25 04:03 jakep-allenai

Thanks. Will try it out.

Mar 14 '25 04:03 sovit-123

Hello @jakep-allenai, are there any instructions for finetuning? What will be the steps if I need to fine-tune the model with custom data?

Apr 22 '25 15:04 Dyopala-Sushil

Sorry, the team has been busy with the olmocr-benchmark and other things. We will have easy training commands available once the benchmark is done!

Apr 22 '25 17:04 jakep-allenai

@jakep-allenai Thanks for responding. Just for conforming, it is possible to fine-tune with custom data using available scripts, right? If yes, could you provide an overview on how to do this?

Apr 23 '25 09:04 Dyopala-Sushil

It is possible using https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py, but you'll have to prepare your data in the way the dataloaders in there expect. You are welcome to reproduce what we did to train the first preview model, but we aren't going to be supporting/documenting this pathway until the next version is out.

Apr 23 '25 16:04 jakep-allenai

Once you unpack the pdfs and convert the parquet back to json (sorry I dont have a guide for this step yet), then you can see the commands used in https://github.com/allenai/olmocr/blob/main/scripts/qwen2vl-7b-gantry.sh for an example.

The core of it is here: python -m olmocr.train.train -c olmocr/train/config/qwen2vl-7b.yaml --num_proc 64

But if you want 8GPUs you'll need to use accelerate as in the shell script.

And then edit your config file https://github.com/allenai/olmocr/blob/main/olmocr/train/config/qwen2vl-7b.yaml

After converting Parquet back to JSON, the resulting JSON only contains the fields "url", "page_number", and "response". It seems that mapping the fields requires a SQLite file. Is there any more detailed documentation or explanation about how this mapping works and how the SQLite file should be used?

Let me know if you need further adjustments or want to add more context!

May 16 '25 10:05 MIracleyin

Hey, yeah, sorry, the train code is not in a happy state right now, but now that our benchmark is done, we hope to improve things. Basically, olmocr-mix was made by prompting chatgpt-4o using the openai batch API. So, what the train code requires now, is to be in the format that the openai batch API would return. (We packaged this up for the launch into the hugging face format).

So, in the train config, when you see response_glob_path: s3://ai2-oe-data/jakep/pdfdata/openai_batch_done_v5_1_eval/*.json, that is looking for json result files that look like:

{"id": "batch_req_66fdb8bd11f081908686975300feed21", "custom_id": "[s3 path to pdf file]-[pagenum]", "response": {"status_code": 200, "request_id": "d701000a72b92b55159f167408429a2b", "body": {"id": "chatcmpl-AE0iDk9HIb2hrRHNKv0QyIdOT4KvO", "object": "chat.completion", "created": 1727902693, "model": "gpt-4o-2024-08-06", "choices": [{"index": 0, "message": {"role": "assistant", "content": "{\"primary_language\":\"en\",\"is_rotation_valid\":true,\"rotation_correction\":0,\"is_table\":false,\"is_diagram\":false,\"natural_text\":\"SECTION 8: EXPOSURE CONTROLS/PERSONAL PROTECTION\\n\\nEye Protection:\\nWear eye protection with side shields or goggles. Wear indirect-vent, impact and splash resistant goggles when working with liquids. If additional protection is needed for entire face, use in combination with a face shield.\\n\\nSkin Protection:\\nUse of gloves approved to relevant standards made from the following materials may provide suitable chemical protection: PVC, neoprene or nitrile rubber gloves. Suitability and durability of a glove is dependent on usage, e.g. frequency and duration of contact, chemical resistance of glove material, glove thickness, dexterity. Always seek advice from glove suppliers. Contaminated gloves should be replaced. Use of an apron and over-boots of chemically impervious materials such as neoprene or nitrile rubber is recommended to avoid skin sensitization. The type of protective equipment must be selected according to the concentration and amount of the dangerous substance at the specific workplace. Launder soiled clothes or properly disposed of contaminated material, which cannot be decontaminated.\\n\\nRespiratory Protection:\\nIf engineering controls do not maintain airborne concentrations to a level which is adequate to protect worker, a respiratory protection program that meets or is equivalent to OSHA 29 CFR 1910.134 and ANSI Z88.2 should be followed. Check with respiratory protective equipment suppliers.\\n\\nAppropriate Eng...

And then the model is trained on your prompt (which will look up the PDF from the "custom_id" path so it can generate the DOCUMENT-ANCHORING prompt), plus the json output from here ex {"primary_language"...... And of course, it needs access to the PDFs in the right places, which is why it's all complicated unfortunately...

Though all the data IS in the hugging face train mix, it's just painful to unpack it back to this chatgpt batch format.

What we need to do is to cleanup the trainer to directly use the hugging face format, and expand things so it's easy to fine tune your own models.

May 20 '25 20:05 jakep-allenai

Yes, the core issue at the moment is how to establish the mapping between the JSON and the PDFs. I noticed that in the parquet files you saved on Hugging Face, the id is composed of a hash-id, and the PDFs in pdf_tarballs are also named using the hash-id. Does this mean that the hash-id serves as the mapping between the PDFs and the JSON? I would be happy to help improve the trainer to make it compatible with the Hugging Face format. Thank you for your excellent work!

May 21 '25 00:05 MIracleyin

Yes, the id field is the mapping between the PDFs in the tarballs and the JSON in the Hugging Face dataset. Would be happy to have a community contribution here.

May 21 '25 15:05 jakep-allenai

After checking the datasets, I found that the IDs in the s2pdf data (example: 8e99d7b772b3a792a33f0de6849a5c14fb252767-4)are correct and can be matched, but the iabook data cannot be matched (example: direct-put-2024-03-05-12-50-00-f1a4b703-2f8c-4b3b-bc79-742f3c311821.pdf-52). Does this mean that the iabook data is not open-sourced? Will this affect the performance of the model?

May 22 '25 03:05 MIracleyin

Hmm, the iabook data should be there, I see direct-put-2024-03-05-12-50-00-f1a4b703-2f8c-4b3b-bc79-742f3c311821.pdf-52.pdf in the local folder where I gzipped everything from. I will check again tomorrow to see if it didn't get uploaded for example

May 22 '25 03:05 jakep-allenai

Just redownloaded and extracted the full set from hugging face, direct-put-2024-03-05-12-50-00-f1a4b703-2f8c-4b3b-bc79-742f3c311821.pdf-52.pdf is also in there, so it should be there!

May 22 '25 16:05 jakep-allenai

I found it in the extracted files, thank you for your response. I will try to reproduce your work as soon as possible.

May 23 '25 06:05 MIracleyin

@jakep-allenai @MIracleyin Hello, are there any updates on stable, official training script? Thanks!

Jun 23 '25 04:06 kimvutht

Being worked on: https://github.com/allenai/olmocr/tree/jakep/new_trainer

Jun 23 '25 16:06 jakep-allenai

I encountered "Only batch size 1 is supported for now" error, while my batch_size is 1 (I got 8 gpus). How should I resolve that? Thanks.

Jun 24 '25 04:06 kimvutht

Gotta wait a bit more, jammed with other work this week, sorry. But lots of good things coming soon.

Jun 24 '25 16:06 jakep-allenai

It's finally merged into master! Check out https://github.com/allenai/olmocr/tree/main/olmocr/train

I'll add more instruction soon. But generally call prepare_olmocrmix.py to grab the data from HF, then you can call train.py with a config. Latest 0.2.0 models are trained exactly with this pipeline.

Jul 23 '25 15:07 jakep-allenai

More docs just added as well.

Jul 24 '25 18:07 jakep-allenai

Thanks a lot. I think I will be able to take a look at it this weekend.

Jul 25 '25 00:07 sovit-123