Docling2parquet support for xlsm input files (extension from the current support for xlsx)
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transforms/pdf2parquet
Feature
The client engineering team is asking for this feature.
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
@dolfim-ibm For your attention.
Can I open a PR for this on docling repo?
The major difference between xlsx and xlsm is that xlsm contains VBA which is like an automation script . The problem is that Docling's ingestion is very tightly bound around a template of DoclingDocument in the output fortmat so even if you extract VBA used in xlsm you cannot store that due the template. One needs to modify DoclingDocument in the docling_core repo. Now a simple or temporary workaround is to ingest xlsm by mapping them to say same backend of xlsx. Yes macros/VBA wont get extracted and stored but it will work temporarily ingesting xlsm and perhaps we can decide properly on what all details of VBA needs to get extracted so we get it into fully supported format? This basically needs to be discussed with the core docling repo builders before changing anything in DoclingDocument.
TL;DR even if u extract VBA features from xlsm you cannot store it in the same format as docling using docling_core's template of DoclingDocument which is another repo . So the only quicker way to bring this feature to life is to treat xlsm as xlsx and map the input to the same backend ingesting it without VBA features.
@shahrokhDaijavad just out of curiosity do they need "VBA /macros" for the ingestion part really or they just want to ingest those files as support because I think the latter might be easier and quicker.
Thanks, @ShiroYasha18. I don't know whether the IBM team asking for xlsm wants to use VBA features or not, but the workaround you are proposing (without VBA support) sounds useful to me.
Thanks a lot for your comment sir @shahrokhDaijavad ! have opened a PR for the same on docling's repo describing things the PR link is : https://github.com/docling-project/docling/pull/1520
Also when you added support for pptx and csv when we moved from pdf2parquet to docling2parquet can you please tell me the PR for same maybe . I will try to reproduce the xlsx support which is missing I pointed out earlier and if my PR gets merged I will write the bridge support for xlsm too. Also are we using the latest version of docling?
@ShiroYasha18 When we moved from pdf2parquet to docling2parquet (PR #1233), we didn't change anything other than the names of files and classes, so no new features were added. Although the README doesn't mention xlsx support, my assumption was that it was an oversight and since it supported pptx and docx, it would have supported xlsx. It would be great if you could verify that. Also, we are not using the latest version of docling! The version we use is here: https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/language/docling2parquet/requirements.txt and is quite a few versions behind: https://github.com/docling-project/docling/blob/main/CHANGELOG.md @dolfim-ibm What is your recommendation for us in updating Docling2parquet in DPK to the latest?
@shahrokhDaijavad sir I tested the existing code for xlsx , it is supported just like other formats of docling. Do I need to open a PR for this or can you directly update the readme? opening prs for such small changes might not be preferred by repo maintainers I think, but anyways can be done if directed to.
Other patches like the one in my open PR #1235 we would need to update the docling version used here so that I can test out DL_PARSE v4 and other tools . similarly for import of SmolDocling in #1145 we might need latest docling version for DPK. And yeah the xlsm support depends upon to be merged in the docling Repo
Thank you for verifying xlsx, @ShiroYasha18. I added it to the README as part of another open PR (#1248). I will wait for @dolfim-ibm's input on updating the docling version.
Yes, we should definitely upgrade the Docling version and expose the pipeline options for selecting the VLM pipeline using SmolDocling.
Regarding the change of values, I'm still suggesting to
- evaluate and report your performance finding as a Docling issue, such that it can be resolved for everybody upstream
- keep alignment with the default Docling options, otherwise we will get into issues like "in DPK is does X and in plan Docling it does Y". (but happy to discuss it)
@dolfim-ibm Thank you soo much for clarification sir .Understood , I will open an issue in docling first so that it gets updated in DPK later on. Give me some time I will evaluate,document and report the pdf backend speeds properly in a new Docling issue properly.Apologies for my misunderstanding, I completely understand your point 2 now as different things in DPK and Docling might not be interest of simplicity and IBM itself I think.
meanwhile I would really be grateful if you have a look at these meanwhile: 1.https://github.com/docling-project/docling/pull/1520 this PR I have submitted in the docling repo itself its a fix which I mentioned to ingest xlsm right now I have discussed this issue here in this issue as well as in the PR description. Please do have a look as again I cannot add reviewers for PR in docling repo too.
- I think I am a bit confused of this whole docling version issue. I think as much as I understand the "requirements.txt" contains an older version that does not exactly mean the code docling2parquet in dpk is referring to is of old version as it is referring to whatever version of docling installed. soo if we change the requirements.txt for future users and the people who have this at the moment then we would be referring to the latest docling version?
@ShiroYasha18 The users of DPK transforms either do make venv in the transform directory to use the transform locally or use the PyPi packages that we have added to PyPi and get installed via pip install, e.g., when using Notebooks that use these transforms. In both cases, the requirements.txt is used to satisfy all the prerequisite packages (including their version numbers).
Hello @shahrokhDaijavad! My pr for support of xlsm files got merged in docling repo here is the link:https://github.com/docling-project/docling/pull/1520. I think we can close this issue now. :)
That's great, @ShiroYasha18! Congratulations!
Now, we can focus on #1253, #1145 and your PR #1235.