data-prep-kit Develop the testing strategy for new NLP modules being added to the kit

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

We are adding document quality and spoken language id NLP modules and new code modules for HAP, License filtering and PII to the kit and we need testing similar (or better!) to what was done for the initial set of code modules.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

May 22 '24 17:05 shahrokhDaijavad

The two new NLP modules; lang_id and doc_quality are being merged. I have already tested lang_id as a unit test (3 test files on a local Mac). Both these transforms are currently being tested regularly on a large cluster in the Pipelines testing by the inner repo team and we do not need a cluster testing strategy. For local testing (and inclusion in a new corresponding Notebook example), it would make sense to identify a small set of input files for which these transforms create meaningfully observable output. I will work with Hamid and Dhiraj in identifying such a set.

Jun 17 '24 18:06 shahrokhDaijavad

@shahrokhDaijavad we have doc_quality and its tests. are they not sufficient? Not sure about other transforms. If there is a problem with a specific transform's test data. maybe we need separate issues?

Sep 13 '24 16:09 daw3rd

@daw3rd I think we should close this as an overarching issue. The language modules have been tested individually and we are creating Notebooks similar to what we have with Code to run them sequentially, which can also be used for testing.

Sep 13 '24 18:09 shahrokhDaijavad

@shahrokhDaijavad ok to close?

Apr 24 '25 06:04 agoyal26

Yes, we can close this, @agoyal26.

Apr 24 '25 20:04 shahrokhDaijavad