data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Docling2parquet needs to handle PDF files with mathematical formulas

Open shahrokhDaijavad opened this issue 4 months ago • 5 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transforms/docling2parquet

Feature

When testing the native docling package with the pdf file attached and using the enrichment features: https://docling-project.github.io/docling/usage/enrichments/, we found that it didn't handle the equations properly. We reported this to the docling team and they fixed it today. We need to bring the --enrich-formula to docling2parquet!

Newton2.pdf

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

shahrokhDaijavad avatar Jul 22 '25 15:07 shahrokhDaijavad

@ShiroYasha18 Is this something you can handle? We now have a real use case, in which we need this feature of docling in DPK.

shahrokhDaijavad avatar Jul 22 '25 15:07 shahrokhDaijavad

Sure @shahrokhDaijavad , thanks for tagging me — I’ll look into this issue and see how best it can be handled.

ShiroYasha18 avatar Jul 22 '25 19:07 ShiroYasha18

Hello soo quick update :

I tested out things with --enrich-formula in docling and tried to bridge it to dpk. The bridge is done via PdfPipelineOptions. However one of the issues I have observed that these for these pdfs the formulas are not really getting decoded. So I tried to test--enrich-formula from docling too and the results were same For example :

Newton2 (2).pdf is the original pdf

Newton2_enhanced_formulas.docx this is after using the --enrich-formula clearly all the formulas are left out / not recognised.

to confirm this I tested it with some more pdfs .

2501.12948v1.pdf

deepseek.docx

In this they do get the formulas but its not organised properly .

cc : @shahrokhDaijavad

ShiroYasha18 avatar Jul 28 '25 18:07 ShiroYasha18

Hello soo quick update :

I tested out things with --enrich-formula in docling and tried to bridge it to dpk. The bridge is done via PdfPipelineOptions. However one of the issues I have observed that these for these pdfs the formulas are not really getting decoded. So I tried to test--enrich-formula from docling too and the results were same For example :

Newton2 (2).pdf is the original pdf

Newton2_enhanced_formulas.docx this is after using the --enrich-formula clearly all the formulas are left out / not recognised.

to confirm this I tested it with some more pdfs .

2501.12948v1.pdf

deepseek.docx

In this they do get the formulas but its not organised properly .

cc : @shahrokhDaijavad

Thanks for the update, @ShiroYasha18. If I apply --enrich-formula to the above pdf file and create an HTML file, with just that option, I get an HTML output file that has the formulas, but not the images. I assume that by adding more options, I can get the images too, because @dolfim-ibm created such an output HTML file. Looks like GIT doesn't allow me to attach HTML files, but in nay case, the formulas are supported when using PDF as input and HTML as output.

shahrokhDaijavad avatar Jul 28 '25 20:07 shahrokhDaijavad

So basically adding a few more options and changing the type to HTML will work for --enrich formula soo it is working just fine right ? Just have a set of defined options to be used . If I am understanding things right then I shall open the PR for the bridge which I have completed already . If I can get the exact options for PDF to HTML such that output file contains formulas, I would also be able to push a readme/example/notebook for the same so that it becomes a easy reference for the future

You got it. Perfect!

ShiroYasha18 avatar Jul 29 '25 00:07 ShiroYasha18