orange3 Order by relevance to subgroups in Box Plot should consider the value of the statistics as well

Order by relevance to subgroups in Box Plot should consider the value of the statistics as well

Open BlazZupan opened this issue 2 years ago • 3 comments

What's wrong?

Box Plot can sort features according to the relevance to subgroups. The score for ordering is the p-value of corresponding statistics, this either being t-test, ANOVA, or chi-square.

The problem emerges if the (rounded?) p-value for a set of features is 0. Than, it may happen, for instance, that the features with the same p-value but different t-test are sorted randomly. Here, we would expect that the feature with a higher value of the statistics are listed first. For instances, in the attached workflow, where we split the data according to HDI, this particular feature, that is, HDI, would be listed first. In this example, HDI has t=20.3, and estimated gross national income has t=13.4, yet the later appears first in the list.

boxplot-ranking-problem.ows.zip

What's your environment? MacOSX, latest dmg version

Jul 21 '22 08:07 BlazZupan

Note to whomever implements this: just change compute_score and compute_stat to return tuple (p, -F) instead of just p (where F is the value of ANOVA statistic). For tests, mock f_oneway.

Jul 22 '22 20:07 janezd

This problem is subtler

Attributes are sorted by p-value of ANOVA, disregarding the number of groups. When showing statistics, though, we compute t-test for binary groups (and ANOVA otherwise). Estimated GNI has an F-value of 249, while HDI has 187). Ordering is thus correct.

T-test is corrected for unequal variance of groups. Because of this correction, it is not equivalent to ANOVA. Solutions:

When sorting, compute t-test for binary and ANOVA for non-binary. Bad because I think we should compute the same for all.
Show ANOVA also for binary. Weird. And also same as 3, essentially.
Show t-test without correction for unequal variance. Bad because variances are not equal.
Keep it as it is. Bad because it looks wrong to the user and explain this in documentation.

Mar 15 '23 19:03 janezd

I think I know a partial solution, hence reopening the issue:

if all variables are binary, they can actually be ranked by the same t-test that is displayed (instead of by ANOVA, which is equivalent to t-test with equal variance);
if all variables are numeric, there should be no problem as it is (check!)
otherwise, we keep it as it is.

Jan 08 '24 13:01 janezd

orange3 orange3 copied to clipboard

Order by relevance to subgroups in Box Plot should consider the value of the statistics as well

orange3
orange3 copied to clipboard