data-prep-kit [Feature]New DPK transform to get the distributions of quality metrics

Search before asking

[x] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

It would be beneficial to have a new DPK transform to capture the distribution of any desired column, making it easier to analyze data and set thresholds. It would be great to include this visualization transform in our GneissWeb recipe notebook. This allows users to annotate their datasets and view the distribution of quality metrics. They can also change the filtering thresholds based on the distributions.

How: Define buckets and calculate number of docs per bucket and save a csv file. pushed my code here:

https://github.ibm.com/ai-models-data/data-prep-kit-inner/blob/hajar_Extreme_tokenized/transforms/language/Extreme_tokenized_docs/Distibution_ray.py

Notebook: https://github.ibm.com/ai-models-data/data-prep-kit-inner/blob/hajar_Extreme_tokenized/transforms/language/Extreme_tokenized_docs/distributions_stats.ipynb

The main part of the transform is quite short and straightforward.

cc: @shahrokhDaijavad @touma-I

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Feb 11 '25 22:02 Hajar-Emami

Thanks, @Hajar-Emami Let me think about whom to ask for help with this.

Feb 11 '25 22:02 shahrokhDaijavad

Would you be interested @Param-S ? I know you were looking at quality stats for code data during code data delivery.

Feb 12 '25 09:02 agoyal26

Can we think of this as an analytics transform with a set of supported functions? Column distribution is one - another is to compute the correlation between two columns.

Let's say it supports two functions defined as:

distribution['column1', 'column2']
correlation(column1, column2)

In the future, we can extend the list of supported functions.

The argument here is of the complexity of such a transform vs having multiple transforms for each function. Happy to hear your thoughts

Mar 05 '25 05:03 Harmedox