data-prep-kit refactoring code to parquet to zip2parquet

Why are these changes needed?

modified code2 parquet to support both code and language files and moved it to universal

Related issue number (if any).

https://github.com/IBM/data-prep-kit/issues/520

Aug 21 '24 12:08 blublinsky

@blublinsky Is this new version completely backward compatible with the old version of code2parquet?

Aug 21 '24 15:08 shahrokhDaijavad

Yes. By default it will work exactly the same as current (all the current tests pass). If you specify code_data == False, it will not try to decide which programming language it is - just store the file content as is

Aug 21 '24 15:08 blublinsky

Thanks, @blublinsky. Perfect! From a technical point of view, David needs to approve, but a note to myself that if this is approved, the example Jupyter notebook for code that Shivdeep has created should be modified with the new name and location of the transform.

Aug 21 '24 16:08 shahrokhDaijavad

i'm not sure I think this has to be backwards compatibile with code2parquet. As such, I would suggest the following:

code is NOT the default
There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example {"programming_language_column" : "some name", }

Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts.

I would like to do more here than just generalize for code and .txt. Some more design seems needed.

Aug 27 '24 22:08 daw3rd

'm not sure I think this has to be backwards compatibile with code2parquet. This was initial requirement from @shahrokhDaijavad

As such, I would suggest the following:

code is NOT the default Does not matter to me

There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example {"programming_language_column" : "some name", } We need them for code support

Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts. The last conversation with @nirmdesai was that we want all files

Aug 28 '24 08:08 blublinsky

Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.

Aug 28 '24 10:08 Bytes-Explorer

Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.

@Bytes-Explorer please take it up with @nirmdesai . It was his request

Aug 28 '24 12:08 blublinsky

Ok, will clarify. Lets not merge this PR till then.

Aug 28 '24 13:08 Bytes-Explorer

@Bytes-Explorer , @blublinsky, @touma-I : Team, since we have various notebooks and other artifacts that already depend on Code2Parquet, we cannot make breaking changes to this transform. It is on me that I did not explicitly clarify this earlier!

For now, it would be best to add Any2Parquet as a separate transform that can read any file content as binary and produce a parquet.

In future, if all notebooks / users were using "pip install" to use a specific stable version of DPK, we would be free to make breaking changes in developing the next release without affecting all the users. I know you all are moving in this direction already.

Aug 28 '24 15:08 nirmdesai

Thanks for the clarifications, @nirmdesai. Since we now want to ingest more than just text files, doing Any2Parquet as a separate module is the best path forward.

In defense of what was done, the changes done to the Code2Parquet were completely backward-compatible with the old version and would not have broken any notebook or artifact, if the name of the module and the path to its directory had not changed. At the same time, the name Code2Parquet was not appropriate, if it was handling both code and non-code text.

Having said that, let's move towards the new Any2Parquet.

Aug 28 '24 16:08 shahrokhDaijavad

@touma-I The code that Boris developed under this PR is valuable because it is a "generalization" of what we have with the current code2parquet. However, in order not to run into issues of backward compatibility with the notebooks/examples that use the current code2parquet, the most useful thing to do is to make changes to this PR, so that it keeps the current code2parquet as is (under the Code directory) and the new modified version by Boris, which should be renamed any2parquet, will be a separate transform under the universal directory.

Oct 14 '24 15:10 shahrokhDaijavad