ExplainaBoard icon indicating copy to clipboard operation
ExplainaBoard copied to clipboard

The number of buckets for analysis files is inconsistent with the predefined ones.

Open jinlanfu opened this issue 2 years ago • 3 comments

Some specific feature values of a feature contain much more samples, such as the lexical_richness feature for text classification, and the samples with lexical_richness=1 occupy 1/3 of the samples in the dataset. These samples should be put into the (last) bucket first to keep the total bucket number equal to the predefined number.

jinlanfu avatar Apr 02 '22 04:04 jinlanfu

Thanks @jinlanfu ! For this issue (and the other one you recently filed), it'd be easier for us to fix it if you provide an example that allows us to reproduce the undesired behavior.

neubig avatar Apr 02 '22 10:04 neubig

Only three buckets are returned when I predefine the number of buckets to 4 (see the following figure). The reason can be that there are many samples with lexical_richness=1, and the sample size of the last bucket: "(0.9111111111111111, 1.0)"' is much larger than the average`(num_samples_of_data/num_buckets). At present, Explainaboard already has the function of pre-defining bucket intervals, which Pengfei shared with me. However, the predefined interval can only be achieved by modifying the source code. Do we need to optimize the predefinition of bucket intervals to support easy modification by users?

截屏2022-04-03 上午8 47 19

jinlanfu avatar Apr 03 '22 01:04 jinlanfu

Thanks! Could you provide a file and command line argument that allows us to reproduce the behavior (a minimal reproducible example)? Based on this we'll fix the immediate problem.

In addition, we definitely need to make the features more configurable. I think this PR will make it much easier: https://github.com/neulab/ExplainaBoard/pull/194

neubig avatar Apr 03 '22 01:04 neubig