ExplainaBoard More Comprehensive Test Suites

This is a draft doc to maintain the design of deeper tests for ExplainaBoard so that potential hidden bugs could be captured when we make major refactoring.

Generally, from the task perspective, the following tests would be added gradually for each task:

1. test if training set features could be automatically disabled when there is no training set
1. test if training set dependent features could be added automatically and if the bucketing value is correct
1. test if the bucket number is correct given a bucket feature
1. test if cli of each task is correct
1. test if user_defined_feature could work
1. test if user_defined_metadata could work
1. test if the specified metric could work
1. test if evaluation speed is normal (this is useful when we make significant architectural modifications)
1. test supported metrics
1. test error cases
- for example, sometimes we don't need to print all cases, e.g., sequence labeling, otherwise, the analysis report would be too large.

Named Entity Recognition

[x] others
[ ] if training set dependent features could work on other ner datasets
[ ] test if user_defined_feature could work
[x] test if user_defined_metadata could work (it seems current dataload didn't support this?)

Bugs (issues) Identification

For example

the training set dependent features of NER task cannon work
the number of buckets is misused (the number of predicted entities are printed)
samples are stored in an inappropriate way (store all samples, which makes the report file too large)
there is a typo in unittest file (line 17)?
the user_defined_metadata features can not work

Word Segmentation

[x] add msr dataset into DataLab and introduce datalab loader into ExplainaBoard

KG

[x] refactor dataloader (support preprocessing)
[x] support datalab

Apr 20 '22 19:04 pfliu-nlp

I think this is great! If we have all these tests, we can be more confident when we merge PRs. You might have already thought of this but just one general comment is that I think it may take quite some time to cover all of these (e.g. all the metrics for all the tasks) so maybe we can write more unit tests for the sub-modules of the system? For example, we can test bucketing by itself without actually going through the loading -> processing process. But I think end-to-end tests can still be very useful to ensure the more important test cases work.

Apr 20 '22 19:04 lyuyangh

I think this is great! If we have all these tests, we can be more confident when we merge PRs. You might have already thought of this but just one general comment is that I think it may take quite some time to cover all of these (e.g. all the metrics for all the tasks) so maybe we can write more unit tests for the sub-modules of the system? For example, we can test bucketing by itself without actually going through the loading -> processing process. But I think end-to-end tests can still be very useful to ensure the more important test cases work.

Aha thanks and this (For example, we can test bucketing by itself without actually going through the loading -> processing process) makes sense! I think I will do both and I find task-by-task testing is pretty rewardable, also, pushing me examining deeply just like a user.

Apr 20 '22 20:04 pfliu-nlp

ExplainaBoard ExplainaBoard copied to clipboard

More Comprehensive Test Suites

Named Entity Recognition

Bugs (issues) Identification

Word Segmentation

KG

ExplainaBoard
ExplainaBoard copied to clipboard