Kun-Lung Wu
Kun-Lung Wu
@shahrokhDaijavad @touma-I This feature requires a `transform` to read a parquet file from one input folder, and produce a filtered parquet file in an output folder, a filtered .arrow file...
@touma-I One additional comment. This use case is based on the a couple of implicit assumptions from one specific tokenization transform implementation and its output folder structure. In the current...
@shahrokhDaijavad thanks for the pointer to the pull request PR #1033 . I will start working with @Hajar-Emami on this issue. @touma-I We need to assume that people have created...
Assumptions made in the current implementation: 1. The features requested in this issue will be added directly into the existing `filter` transform, since the filtering criteria can only be specified...
In testing my current implementation, I realized that one of the challenging errors was caused by the filter transform returning an empty table, producing an empty parquet file in the...
@revit13 I am enhancing the `filter` transform to add new features. The code is working when I do `run-python-cli-sample`. I plan to add additional tests, and expected the old tests...
@touma-I @shahrokhDaijavad I have fixed the tests. The enhanced `FilterTransform` class passed the original tests, except `test_filter_spark.py`, which raises a question. Are we supporting spark? If yes, there is no...
A simple transform is available in `transforms/language/text_encoder`.
I have created a fork and branch to adapt the text_encoder to GPU for scale operations. https://github.com/klwuibm/data-prep-kit/blob/text_encoder/transforms/language/text_encoder/dpk_text_encoder/transform.py. @ian-cho
Additional work: 1. requirements.txt needs `pytorch`, `cuda` and `sentence_transformers`. 2. If the GPU clusters can use container image, then we need to ensure the Dockerfile has proper packages installed. 3....