Develop the testing strategy for new NLP modules being added to the kit
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
We are adding document quality and spoken language id NLP modules and new code modules for HAP, License filtering and PII to the kit and we need testing similar (or better!) to what was done for the initial set of code modules.
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
The two new NLP modules; lang_id and doc_quality are being merged. I have already tested lang_id as a unit test (3 test files on a local Mac). Both these transforms are currently being tested regularly on a large cluster in the Pipelines testing by the inner repo team and we do not need a cluster testing strategy. For local testing (and inclusion in a new corresponding Notebook example), it would make sense to identify a small set of input files for which these transforms create meaningfully observable output. I will work with Hamid and Dhiraj in identifying such a set.
@shahrokhDaijavad we have doc_quality and its tests. are they not sufficient? Not sure about other transforms. If there is a problem with a specific transform's test data. maybe we need separate issues?
@daw3rd I think we should close this as an overarching issue. The language modules have been tested individually and we are creating Notebooks similar to what we have with Code to run them sequentially, which can also be used for testing.
@shahrokhDaijavad ok to close?
Yes, we can close this, @agoyal26.