data-prep-kit New Transform Modules

New "enrichment" transform that computes a number of features that can be later used to estimate the data quality.

New "LM filter" transform that filters the data using conditions specified in a yaml file.

Mar 17 '25 17:03 cpendus

Thanks @cpendus Please provide more details about "what are the new features in enrichment transform?" and if you have made sure they are not covered in existing transforms. Have you seen this work? https://github.com/data-prep-kit/data-prep-kit/issues/1045 Is your proposal along similar lines?

Mar 18 '25 05:03 agoyal26

@cpendus @agoyal26 I don't think #1045 which is about a tool to create distribution of quality metrics is overlapping with this. In fact, #1045 can use the outcome of the "enrichment". However, we have another document quality transform https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_quality that also supports a few languages. So, I think a paragraph about exactly what "enrichment" does, will be very useful.

Mar 18 '25 16:03 shahrokhDaijavad

The features added are enumerated in the README for the module, with self explanatory names (hopefully).

Some of these features, will probably overlap with other modules. For example, It might very well be, that some another module implements 'avg_word_length' but it is more efficient to have all related features together.

One thing to be noted is that all these features are computed per row, so the module requires only one pass.

Mar 18 '25 18:03 cpendus

Thanks, @cpendus. OK, now I have read the README carefully and I understand all the columns that are being added. Your notebook makes it even clearer! Question about the language column: What are the possible languages? Later on, in your ml_filter transform, you specify a filter condition for english and another for french. Is there a list of all the possible languages that these 2 transforms support?

Mar 18 '25 19:03 shahrokhDaijavad

Th language id is an arbitrary input parameter. We gave it special meaning only because the filter conditions are grouped by it, for readability. For example, the 'avg_word_length' would be expected to be larger for German than for English. We also have catch alls: 'default' for conditions that apply to all languages and 'other' for languages not explicitly mentioned.

So practically any language (or any other label for that matter) is supported.

Mar 18 '25 19:03 cpendus

filter and enrichment transforms available

Jun 24 '25 20:06 swith005