refactoring code to parquet to zip2parquet
Why are these changes needed?
modified code2 parquet to support both code and language files and moved it to universal
Related issue number (if any).
https://github.com/IBM/data-prep-kit/issues/520
@blublinsky Is this new version completely backward compatible with the old version of code2parquet?
Yes. By default it will work exactly the same as current (all the current tests pass). If you specify code_data == False, it will not try to decide which programming language it is - just store the file content as is
Thanks, @blublinsky. Perfect! From a technical point of view, David needs to approve, but a note to myself that if this is approved, the example Jupyter notebook for code that Shivdeep has created should be modified with the new name and location of the transform.
i'm not sure I think this has to be backwards compatibile with code2parquet. As such, I would suggest the following:
- code is NOT the default
- There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example
{"programming_language_column" : "some name", }
Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts.
I would like to do more here than just generalize for code and .txt. Some more design seems needed.
'm not sure I think this has to be backwards compatibile with code2parquet. This was initial requirement from @shahrokhDaijavad
As such, I would suggest the following:
- code is NOT the default Does not matter to me
- There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example
{"programming_language_column" : "some name", }We need them for code supportSeparately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts. The last conversation with @nirmdesai was that we want all files
Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.
Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.
@Bytes-Explorer please take it up with @nirmdesai . It was his request
Ok, will clarify. Lets not merge this PR till then.
@Bytes-Explorer , @blublinsky, @touma-I : Team, since we have various notebooks and other artifacts that already depend on Code2Parquet, we cannot make breaking changes to this transform. It is on me that I did not explicitly clarify this earlier!
For now, it would be best to add Any2Parquet as a separate transform that can read any file content as binary and produce a parquet.
In future, if all notebooks / users were using "pip install" to use a specific stable version of DPK, we would be free to make breaking changes in developing the next release without affecting all the users. I know you all are moving in this direction already.
Thanks for the clarifications, @nirmdesai. Since we now want to ingest more than just text files, doing Any2Parquet as a separate module is the best path forward.
In defense of what was done, the changes done to the Code2Parquet were completely backward-compatible with the old version and would not have broken any notebook or artifact, if the name of the module and the path to its directory had not changed. At the same time, the name Code2Parquet was not appropriate, if it was handling both code and non-code text.
Having said that, let's move towards the new Any2Parquet.
@touma-I The code that Boris developed under this PR is valuable because it is a "generalization" of what we have with the current code2parquet. However, in order not to run into issues of backward compatibility with the notebooks/examples that use the current code2parquet, the most useful thing to do is to make changes to this PR, so that it keeps the current code2parquet as is (under the Code directory) and the new modified version by Boris, which should be renamed any2parquet, will be a separate transform under the universal directory.