data-prep-kit [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet

Search before asking

[x] I searched the issues and found no similar issues.

Component

transforms/pdf2parquet

Feature

Hi I was recently been working with pdf2parquet for couple of months and I have tested the pdf2parquet on multiple documents for my internship project at IBM. I have seen that traditional ocr fails on handwritten documents and documents with other issues like multicolor and different size of fonts which is very classical for the traditional ocrs. For my project I used a vlm but recently IBM's docling team released smoldocling which sure is pretty impressive. I would like to contribute to this project by integrating that feature in the pdf2parquet pipeline! This will enable the support for pdfs and maybe images too in future there comes a image2parquet around.

Are you willing to submit a PR?

[x] Yes I am willing to submit a PR!

Mar 19 '25 17:03 ShiroYasha18

@ShiroYasha18 Thank you. I would like to tag @dolfim-ibm for his thoughts on this. Also, Do you happen to have specific example of sample data that shows where the gap is in today's release of the code and how the proposed enhancement will address those gaps?

Mar 20 '25 19:03 touma-I

You basically ehave to upgrade the Docling version and reproduce this PR which exposes the option in the Docling CLI: https://github.com/docling-project/docling/pull/1199/files#diff-bab084dbf6f5d7e3159fc059293894cf3cf58bf0fa70bd154382e03d9ba0184b.

In practice:

Upgrade Docling
Define the new pipeline option
Modify the pdf2parquet code setting the pipeline_options similar to the PR linked above.
This might also require updating the test results.

Mar 21 '25 04:03 dolfim-ibm

@ShiroYasha18 Did the above steps help ?

Mar 25 '25 04:03 agoyal26

Hello sorry for the late reply !

@touma-I Thank you for assigning this issue to me. I will do the PR for the same. The major sample data which I have tested this for is handwritten answer sheets which are present in the educational institutions . Basically the basic ocr doesnot perform well for such data so it was expected that the the text wont be extracted . here is a sample image I am talking about [

SmolDocling as its is a vision based model it improves the ocr capablities taking account the "layout" too which is a major problem of traditional ocr like easyocr. It extracted good amount of text from this image . As much as I have tested those it works pretty well with things like handwritten text documents like medical invoices etc. @dolfim-ibm Thank you for the steps and the guidance I will look into these and let you know if there be any other problem !

Mar 25 '25 05:03 ShiroYasha18

@ShiroYasha18 I just went through this issue and understood what this is about and how it can be fixed (using the steps by @dolfim-ibm). The reason @touma-I assigned it to me is just to follow-up with you and when you have submitted a PR, review your PR. When do you think you will be able to submit a PR?

Apr 01 '25 22:04 shahrokhDaijavad

Hi @dolfim-ibm, @shahrokhDaijavad,

I wanted to provide a quick update on this issue. First, my apologies for the delayed follow-up—I’ve been working through the implementation details and testing SmolDocling locally to ensure a smooth integration. As I’m relatively new to the codebase, I’ve been taking time to thoroughly understand the workflow (especially referencing PR #1199 ) to avoid missteps.

That said, I’m treating this as high priority and will submit a PR by next Wednesday (9/04/2025) at the latest. If there are any specific considerations or potential roadblocks I should be aware of, please let me know! I’ll also share incremental updates if that helps.

Thank you for your patience—I’m committed to seeing this through and will make sure it’s done right.

Apr 02 '25 07:04 ShiroYasha18

Thanks, @ShiroYasha18. Sounds good. What is PR #1199 that you are referring to? You must mean a different PR.

Apr 02 '25 14:04 shahrokhDaijavad

The one @dolfim-ibm gave me to look into https://github.com/docling-project/docling/pull/1199/files#diff-bab084dbf6f5d7e3159fc059293894cf3cf58bf0fa70bd154382e03d9ba0184b

my apologies for confusion its from docling repo

Apr 02 '25 15:04 ShiroYasha18

Thanks for the clarification, @ShiroYasha18 !

Apr 02 '25 16:04 shahrokhDaijavad

@shahrokhDaijavad can you please help me with this?

@dolfim-ibm quick updates: so I read the code in both the repos and saw the PR you mentioned. I get that as the SmolDocling is already integrated in docling and also pdf2parquet of the dpk uses docling directly without having a copy folder in this repo. So technically with your steps, if I upgrade the version of the docling the support for SmolDocling gets unlocked . Now once this is done from what I understand is there is no file pipeline_options as docling is getting refrenced directly . The PR was merged so I am assuming the pipeline options got updated . so I can import those pipeline options for SmolDocling in this pdf2parquet code. What I am stuck with is once I import the pipeline_options here how will it actually be used ? I have understood upto the point I import the pipeline options here but then how to actually use those pipeline options ? I can also see that the code in pipeline_options also contains the function to call the SmolDocling but how to use that in pdf2parquet code?

Apr 04 '25 20:04 ShiroYasha18

Hi, @ShiroYasha18. Sorry that I haven't responded so far. Now that pdf2parquet => docling2parquet transition has completed, can you please experiment with the do_ocr parameter set to true on your example file and see what result you get?

Apr 28 '25 22:04 shahrokhDaijavad

Sorry, @ShiroYasha18. I had not seen PR #1235 !

Apr 29 '25 15:04 shahrokhDaijavad

Hello @shahrokhDaijavad here is video of testing the handwritten pdf on the ocr using docling2parquet. https://drive.google.com/file/d/1FMtiwgUXkCVuNsTHKF0oHgupSiPyLWnz/view?usp=drive_link

Few observations :

do_ocr does not work it has to be prepended with docling2parquet_ to actaully run without errors
OCR (easyocr/tesseract) is not made for handwritten text soo the output paraquet is not up to the mark as seen in the demo video link
OCR sure is working but there is the issue of default ocr and param of do_ocr set to True or False it does not matter I have written this issue in detail in #1235 and also in a seperate issue #1239 .
I am not sure if it is concerning or not but docling2parquet is taking somewhat more resources to run. I am on a M2 macbook 8gb ram and certainly I observed the spike in the IDE consumption of memory stressing out the memory consumption kind of hanging apps in the background. I came to know about this due to a music video on chrome in youtube started hanging when model intialised log happens. Now this was just for one 2 page pdf if there are multiple 100s of pages pdf I dont know about the memory consumption but we need to check that I think !

So overall we kind of need a vlm to match the ocr level but would need to manage the time to also for the whole conversion.

Do let me know if you need me to test it on some other pdf too !

Apr 29 '25 16:04 ShiroYasha18

@shahrokhDaijavad ,Hello I have been reading the code for sometime now switching continuously between docling and dpk. As the team aims to sync dpk to all the changes in docling to the latest version, One of the significant differences is the addition of the VLMs. I was figuring out the backend which maps to the VLMs(Including SmolDocling). As much as I understood the Docling's code they have mapped VLMs to completely a seperate backend pipeline for when we are opting for VLMs so options like in the [image]

and for current docling2parquet wont be used in the vlm pipeline, rather they process it to images and then pass that to vlm to extract whatever is there in that which is a classical approach and makes sense as they are utlising the actual powers of vlm

From my understanding of the docling's repo they aim to add wayy more models along with SmolDocling. Right now I think they have also support for Granite Vision model the 2B version . Although that is via API or via OLLAMA(locally). Here is a link of a PR which backs the claim that they are going to add many models in the near future:https://github.com/docling-project/docling/pull/1570 UPDATE: It is already merged, however they have not wrote cli options for that so we cannot call other models merged in the above pr as of now in dpk (writing on 4/06/2025)

Another thing which I keep on seeing with respect to this particular VLM pipeline is as they require a lot of compute they "recommend" using mlx which is the metal based inference for apple based silicon chips specifically which increases the speed siignificantly. They do also support cuda and cpu and I think things till now by default set to cpu explaining the speeds.

we inorder to get things updated we need to basically define the current pipeline as standard for CLI and there would be a new VLM pipeline for cli.

Jun 04 '25 05:06 ShiroYasha18

Thank you for your investigation of this, @ShiroYasha18. This is useful. I think, as DPK moves in the direction of supporting some multimedia transforms (there is an outstanding PR for that), it would be great to have the support for the VLM pipeline that SmolDocling has. @dolfim-ibm has already defined the steps needed to do this (see above). I know there are some tangential discussions in the PR you submitted (#1235 ), which were reviewed by @dolfim-ibm. Can you please go back to that PR and 1) take into account the comments, and 2) make the required changes that would transition the current Docling2parquet to the version that supports VLMs via the latest Docling that has SmolDoclong integrated?

Jun 04 '25 18:06 shahrokhDaijavad