data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Provide an operator that loads files content to parquet

Open touma-I opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

We currently have 3 transforms for HTML2Parquet, PDF2Parquet and Code2Parquet. As a user, I want to be able to specify any file type (.txt, .image, .py, whatever) and have its content loaded to parquet.

Assumption: This stage does not try to understand the blob in the loaded file. It is assumed that there will be other transforms in the next stage that understands the content type and process it appropriately but the first stage is is simply loading the content to parquet.

Question: cc @nirmdesai If the file is an aggregate of multiple files, (I.e. .tar) do we want its content untarred and each file in a separate row.? If the file is compressed (i.e. .zip) do we want it unzipped ?

cc: @shahrokhDaijavad, Please capture in reply in this issue any additional information you have. I want to make sure all the points for discussion on this issue are capture here. Thanks

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

touma-I avatar Aug 26 '24 15:08 touma-I

Additional info: The zip2parquet PR #525 implementation by Boris (not merged) is a superset of Code2Parquet that in default mode acts exactly like Code2parquet (on a zip of code files), but with setting a command line flag, it can also handle a zip of .txt files.

shahrokhDaijavad avatar Aug 26 '24 17:08 shahrokhDaijavad

So perhaps we need a set of extensions specified to zip2parquet to configure which files from the zip are imported - 1 file per row with a column indicating the source file name from the zip. The default could just be all files I suppose.

daw3rd avatar Aug 26 '24 23:08 daw3rd