[Feature] Docling2parquet needs to handle PDF files with mathematical formulas
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transforms/docling2parquet
Feature
When testing the native docling package with the pdf file attached and using the enrichment features: https://docling-project.github.io/docling/usage/enrichments/, we found that it didn't handle the equations properly. We reported this to the docling team and they fixed it today.
We need to bring the --enrich-formula to docling2parquet!
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
@ShiroYasha18 Is this something you can handle? We now have a real use case, in which we need this feature of docling in DPK.
Sure @shahrokhDaijavad , thanks for tagging me — I’ll look into this issue and see how best it can be handled.
Hello soo quick update :
I tested out things with --enrich-formula in docling and tried to bridge it to dpk. The bridge is done via PdfPipelineOptions. However one of the issues I have observed that these for these pdfs the formulas are not really getting decoded. So I tried to test--enrich-formula from docling too and the results were same For example :
Newton2 (2).pdf is the original pdf
Newton2_enhanced_formulas.docx this is after using the --enrich-formula clearly all the formulas are left out / not recognised.
to confirm this I tested it with some more pdfs .
In this they do get the formulas but its not organised properly .
cc : @shahrokhDaijavad
Hello soo quick update :
I tested out things with
--enrich-formulain docling and tried to bridge it to dpk. The bridge is done viaPdfPipelineOptions. However one of the issues I have observed that these for these pdfs the formulas are not really getting decoded. So I tried to test--enrich-formulafrom docling too and the results were same For example :Newton2 (2).pdf is the original pdf
Newton2_enhanced_formulas.docx this is after using the
--enrich-formulaclearly all the formulas are left out / not recognised.to confirm this I tested it with some more pdfs .
In this they do get the formulas but its not organised properly .
cc : @shahrokhDaijavad
Thanks for the update, @ShiroYasha18. If I apply --enrich-formula to the above pdf file and create an HTML file, with just that option, I get an HTML output file that has the formulas, but not the images. I assume that by adding more options, I can get the images too, because @dolfim-ibm created such an output HTML file. Looks like GIT doesn't allow me to attach HTML files, but in nay case, the formulas are supported when using PDF as input and HTML as output.
So basically adding a few more options and changing the type to HTML will work for --enrich formula soo it is working just fine right ? Just have a set of defined options to be used .
If I am understanding things right then I shall open the PR for the bridge which I have completed already . If I can get the exact options for PDF to HTML such that output file contains formulas, I would also be able to push a readme/example/notebook for the same so that it becomes a easy reference for the future
You got it. Perfect!