orange3
orange3 copied to clipboard
Order by relevance to subgroups in Box Plot should consider the value of the statistics as well
What's wrong?
Box Plot can sort features according to the relevance to subgroups. The score for ordering is the p-value of corresponding statistics, this either being t-test, ANOVA, or chi-square.
The problem emerges if the (rounded?) p-value for a set of features is 0. Than, it may happen, for instance, that the features with the same p-value but different t-test are sorted randomly. Here, we would expect that the feature with a higher value of the statistics are listed first. For instances, in the attached workflow, where we split the data according to HDI, this particular feature, that is, HDI, would be listed first. In this example, HDI has t=20.3, and estimated gross national income has t=13.4, yet the later appears first in the list.

boxplot-ranking-problem.ows.zip
What's your environment? MacOSX, latest dmg version
Note to whomever implements this: just change compute_score
and compute_stat
to return tuple (p, -F)
instead of just p
(where F is the value of ANOVA statistic). For tests, mock f_oneway
.
This problem is subtler
Attributes are sorted by p-value of ANOVA, disregarding the number of groups. When showing statistics, though, we compute t-test for binary groups (and ANOVA otherwise). Estimated GNI has an F-value of 249, while HDI has 187). Ordering is thus correct.
T-test is corrected for unequal variance of groups. Because of this correction, it is not equivalent to ANOVA. Solutions:
- When sorting, compute t-test for binary and ANOVA for non-binary. Bad because I think we should compute the same for all.
- Show ANOVA also for binary. Weird. And also same as 3, essentially.
- Show t-test without correction for unequal variance. Bad because variances are not equal.
- Keep it as it is. Bad because it looks wrong to the user and explain this in documentation.
I think I know a partial solution, hence reopening the issue:
- if all variables are binary, they can actually be ranked by the same t-test that is displayed (instead of by ANOVA, which is equivalent to t-test with equal variance);
- if all variables are numeric, there should be no problem as it is (check!)
- otherwise, we keep it as it is.