data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Migration from pdf2parquet to Docling2parquet

Open shahrokhDaijavad opened this issue 8 months ago • 4 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transforms/pdf2parquet

Feature

The name pdf2parquet is not appropriate since the underlying Docling package can handle many more input formats than pdf, e.g., DOCX, XLSX, PPTX, CSV, ...

In the first phase, we make the name change to the transform itself and in a second stage, we make changes to the Notebooks that were using the transform with its old name.

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

shahrokhDaijavad avatar Apr 02 '25 00:04 shahrokhDaijavad

@shahrokhDaijavad hello sir! Is the migration from pdf2parquet to docling2parquet happening? If so I would like to contribute in the migration code if possible please!

ShiroYasha18 avatar Apr 24 '25 03:04 ShiroYasha18

Hello, @ShiroYasha18. Thank you for your interest. I have submitted a PR for this which is not passing a key CI/CD test. I am waiting for help from a key developer, but if you know enough and are interested in helping, your contribution is most welcome.

shahrokhDaijavad avatar Apr 24 '25 04:04 shahrokhDaijavad

hello @shahrokhDaijavad sir , I was checking your PR . As much as I understood there are some kfp testcases failing which I dont know much about unfortunately. However most of the commits I saw in that included renaming the pdf2parquet to docling2parquet but isnt it that pdf2parquet was mostly for pdfs only ? docling2parquet is a awesome idea but as much as I understand sir it might require to define new pipeline for that which not only receives pdfs but other formats like csvs xlsx etc too . Please correct me if I am wrong but it still would be just accepting pdfs just under a different name ?

ShiroYasha18 avatar Apr 24 '25 15:04 ShiroYasha18

@ShiroYasha18 Thanks for your comment. We will fix the kfp tests. I was worried about a test-src failure last night that is now passing. We had already changed the code behind pdf2parquet, so it could handle all other input formats that the docling package can (csvs, xlsx, etc). We needed to do the renaming, which turned out to be not as straightforward as I thought!

shahrokhDaijavad avatar Apr 24 '25 21:04 shahrokhDaijavad

integrated

swith005 avatar Jun 24 '25 19:06 swith005