data-prep-kit [Feature] Filter both the parquet and arrow files and update the metadata simultaneously

Search before asking

[x] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Filter the corresponding arrow files simultaneously in the parquet-level filtering pipeline, so the tokenization step won’t be necessary after filtering. It also streamlines the preparation of the ablation data by eliminating the need to re-tokenize, if the data is already tokenized.

How: Filter both the parquet and arrow files and update the metadata simultaneously. Please find the link to the code that can filter both the parquet and arrow files and update the metadata simultaneously:

https://github.ibm.com/ai-models-data/data-prep-kit-inner/blob/hajar_readability_sco[…]forms/language/Readabilty%20Scores/Parquet_Arrow_Filtering.py](https://github.ibm.com/ai-models-data/data-prep-kit-inner/blob/hajar_readability_scores/transforms/language/Readabilty%20Scores/Parquet_Arrow_Filtering.py) I include only the filtering parts (the original code also includes the annotation). Main functions are: write_to_arrow_doc_files writing the arrow files and meta files filter_lowQual_Rscore_pq filters from parquet files filter_Rscore_arw: filters from arrow files

Please let me know if you have any questions. The main part of the code is straightforward. I’d be happy to schedule a call to explain the details.

cc: @shahrokhDaijavad @touma-I

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Feb 11 '25 22:02 Hajar-Emami

Thanks, @Hajar-Emami.

@swith005 Can you please help with this? @Hajar-Emami has already created the code for doing this internally, but an open DPK transform for this will be a great contribution to both GneissWeb and DPK. Please discuss this with @Hajar-Emami. Thank you.

Feb 11 '25 22:02 shahrokhDaijavad

@shahrokhDaijavad @touma-I This feature requires a transform to read a parquet file from one input folder, and produce a filtered parquet file in an output folder, a filtered .arrow file in a 2nd output folder, and an updated .docs and an updated .docs.ids files in a 3rd output folder. The current DPK design of one input to one output folder doesn't seem to support the requirements. @daw3rd @blublinsky Is my understanding in the ballpark?
CC: @Hajar-Emami @cmadam

Mar 04 '25 15:03 klwuibm

@klwuibm I am ok with changing how the current transforms work. The architecture eventually will have to evolve to meet the needs of a broader set of users. The key questions are: 1- How may use cases could benefit from this type of changes ? cc: @sujee may be able to weigh in. 2- What is the over all impact on the architecture (i.e. effort required, number of modules impacted, etc.)

@agoyal26 @shahrokhDaijavad for your awareness. This one input to one output architecture per transforms works well but is very limiting. Are you hearing anything related to this from other users.

Mar 05 '25 15:03 touma-I

@touma-I One additional comment. This use case is based on the a couple of implicit assumptions from one specific tokenization transform implementation and its output folder structure. In the current tokenization transform in the outer repo, it doesn't produce the corresponding .arrow, meta/*.docs, meta/*.docs.ids files.

Mar 05 '25 20:03 klwuibm

@klwuibm Can we address that by combining the tokenization and the tokenization2arrow transforms into one ?

Mar 05 '25 21:03 touma-I

@klwuibm I don't know if you are aware of the tokenization2arrow transform that is already in the outer as a PR #1033 and is about to be merged. So, what @touma-I is suggesting is that a possible solution to this issue could be combining these two transforms. The reason @Hajar-Emami created this issue is that in her Gneissweb recipe notebook, to use the extreme-tokenized transform, she had to work with both parquet and their corresponding arrow files for filtering, and so she created this Parquet_Arrow_Filtering.py code that she is referring to.

Mar 05 '25 23:03 shahrokhDaijavad

@shahrokhDaijavad thanks for the pointer to the pull request PR #1033 . I will start working with @Hajar-Emami on this issue.

@touma-I We need to assume that people have created their input/output folders with the to-be-merged tokenization2arrow transform. With that as the structure, we will extend from the current filter transform, because the filtering criteria will be specified based on the parquet files. We need three input folder paths to be specified, one via "input_folder" to this new transform, the other two will have to come in via additional input_parameter route. We also will need three different output folder paths. One via the "output_folder", and the other 2 can be either derived internally following the structure assumed by the tokenization2arrow transform, or explicitly specified via additional input_parameter route. There are pros and cons for each, and will evolve and iterate.

Mar 06 '25 21:03 klwuibm

Assumptions made in the current implementation:

The features requested in this issue will be added directly into the existing filter transform, since the filtering criteria can only be specified based on the input parquet.
Three more configuration parameters are added: input_arrow_folder, output_arrow_folder and doc_id_column_name. The meta folder will be assumed to be inside the arrow folder. If no input_arrow_folder is specified, only the parquet files will be filtered. The doc_id_column_name must match what is used in the parquet. Otherwise, the code will exit with an assertion error.
The arrow folder and meta folder are assumed to be accessible with the same data access credentials for the input/output parquet folders.
The code will be inserted before the filtered table is returned by the transform() method, including (a) constructing the .arrow, .docs, and .docs.ids file paths, (b) reading in the arrow and meta file contents, (c) adjusting the arrow and meta file contents based on the filtered table and (d) finally writing out the adjusted arrow and meta file contents.

Mar 14 '25 15:03 klwuibm

In testing my current implementation, I realized that one of the challenging errors was caused by the filter transform returning an empty table, producing an empty parquet file in the output folder. I have created a bug #1139 and will fix it as part of the pull request to be created for this feature. @shahrokhDaijavad @touma-I Please assign #1139 to me.

Mar 15 '25 17:03 klwuibm

Thank you, @klwuibm. Your assumptions above look good to me. Also, thank you for finding the bug above and fixing it with your upcoming PR. I assigned #1139 to you.

Mar 15 '25 17:03 shahrokhDaijavad

@revit13 I am enhancing the filter transform to add new features. The code is working when I do run-python-cli-sample. I plan to add additional tests, and expected the old tests to work, since I didn't change code related to the filtering of the parquet files. However, the make test failed and it seems that no data access object was created in test_filter.py.. I added the following to the init of FilterTransform class. And it failed due to self.data_access is None.

self.data_access: DataAccess = config.get("data_access", None)
if self.data_access is None:
      self.logger.error(f"data_access is not provided.")

===================================================================== test session starts ======================================================================
platform darwin -- Python 3.10.9, pytest-8.3.5, pluggy-1.5.0
rootdir: /Users/kun-lungwu-m1/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collecting ... 09:15:20 ERROR - data_access is not provided.
collected 6 items / 1 error                                                                                                                                    

============================================================================ ERRORS ============================================================================
____________________________________________________ ERROR collecting universal/filter/test/test_filter.py _____________________________________________________
test_filter.py:73: in get_test_transform_fixtures
    self.create_filter_test_fixture(
test_filter.py:59: in create_filter_test_fixture
    return FilterTransform(config), [input_df], [expected_output_df], expected_metadata_list
../dpk_filter/transform.py:104: in __init__
    self.parquet_input_folder = self.data_access.get_input_folder()
E   AttributeError: 'NoneType' object has no attribute 'get_input_folder'
=================================================================== short test summary info ====================================================================
ERROR test_filter.py::TestFilterTransform - AttributeError: 'NoneType' object has no attribute 'get_input_folder'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================================================= 1 error in 0.95s =======================================================================

@touma-I @shahrokhDaijavad

Mar 17 '25 14:03 klwuibm

@touma-I @shahrokhDaijavad I have fixed the tests. The enhanced FilterTransform class passed the original tests, except test_filter_spark.py, which raises a question. Are we supporting spark? If yes, there is no config/spark_profile_local.yml in the repo, causing the launching of spark to fail and the test_filter_spark.py to fail. If no, should we comment out the spark-related tests?

I have added one test case each to test_support.py so that both python and ray will be tested against the new test case. This new test case only verified on the expected parquet, not the resulting arrow and meta files. I am not sure if the test infrastructure can be used to test expected arrow and meta files.

Mar 19 '25 12:03 klwuibm