ExplainaBoard
ExplainaBoard copied to clipboard
More Comprehensive Test Suites
This is a draft doc to maintain the design of deeper tests for ExplainaBoard so that potential hidden bugs could be captured when we make major refactoring.
Generally, from the task perspective, the following tests would be added gradually for each task:
-
- test if training set features could be automatically disabled when there is no training set
-
- test if training set dependent features could be added automatically and if the bucketing value is correct
-
- test if the bucket number is correct given a bucket feature
-
- test if cli of each task is correct
-
- test if user_defined_feature could work
-
- test if user_defined_metadata could work
-
- test if the specified metric could work
-
- test if evaluation speed is normal (this is useful when we make significant architectural modifications)
-
- test supported metrics
-
- test error cases
- for example, sometimes we don't need to print all cases, e.g., sequence labeling, otherwise, the analysis report would be too large.
Named Entity Recognition
- [x] others
- [ ] if training set dependent features could work on other ner datasets
- [ ] test if user_defined_feature could work
- [x] test if user_defined_metadata could work (it seems current dataload didn't support this?)
Bugs (issues) Identification
For example
- the training set dependent features of NER task cannon work
- the number of buckets is misused (the number of predicted entities are printed)
- samples are stored in an inappropriate way (store all samples, which makes the report file too large)
- there is a typo in unittest file (line 17)?
- the user_defined_metadata features can not work
Word Segmentation
- [x] add
msr
dataset into DataLab and introduce datalab loader into ExplainaBoard
KG
- [x] refactor dataloader (support preprocessing)
- [x] support datalab
I think this is great! If we have all these tests, we can be more confident when we merge PRs. You might have already thought of this but just one general comment is that I think it may take quite some time to cover all of these (e.g. all the metrics for all the tasks) so maybe we can write more unit tests for the sub-modules of the system? For example, we can test bucketing by itself without actually going through the loading -> processing process. But I think end-to-end tests can still be very useful to ensure the more important test cases work.
I think this is great! If we have all these tests, we can be more confident when we merge PRs. You might have already thought of this but just one general comment is that I think it may take quite some time to cover all of these (e.g. all the metrics for all the tasks) so maybe we can write more unit tests for the sub-modules of the system? For example, we can test bucketing by itself without actually going through the loading -> processing process. But I think end-to-end tests can still be very useful to ensure the more important test cases work.
Aha thanks and this (For example, we can test bucketing by itself without actually going through the loading -> processing process
) makes sense! I think I will do both and I find task-by-task testing is pretty rewardable, also, pushing me examining deeply just like a user.