[Feature]New DPK transform to get the distributions of quality metrics
Search before asking
- [x] I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
It would be beneficial to have a new DPK transform to capture the distribution of any desired column, making it easier to analyze data and set thresholds. It would be great to include this visualization transform in our GneissWeb recipe notebook. This allows users to annotate their datasets and view the distribution of quality metrics. They can also change the filtering thresholds based on the distributions.
How: Define buckets and calculate number of docs per bucket and save a csv file. pushed my code here:
https://github.ibm.com/ai-models-data/data-prep-kit-inner/blob/hajar_Extreme_tokenized/transforms/language/Extreme_tokenized_docs/Distibution_ray.py
Notebook: https://github.ibm.com/ai-models-data/data-prep-kit-inner/blob/hajar_Extreme_tokenized/transforms/language/Extreme_tokenized_docs/distributions_stats.ipynb
The main part of the transform is quite short and straightforward.
cc: @shahrokhDaijavad @touma-I
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Thanks, @Hajar-Emami Let me think about whom to ask for help with this.
Would you be interested @Param-S ? I know you were looking at quality stats for code data during code data delivery.
Can we think of this as an analytics transform with a set of supported functions? Column distribution is one - another is to compute the correlation between two columns.
Let's say it supports two functions defined as:
- distribution['column1', 'column2']
- correlation(column1, column2)
In the future, we can extend the list of supported functions.
The argument here is of the complexity of such a transform vs having multiple transforms for each function. Happy to hear your thoughts