tableone icon indicating copy to clipboard operation
tableone copied to clipboard

Support multilevel groupby

Open jtleider opened this issue 5 years ago • 2 comments

Hi,

This code closes #18, adding support for multilevel groupby. It also fixes a bug where in some cases descriptives for categorical and continuous variables were being shown in separate columns if a dtype category groupby variable was used.

Best, Julien

jtleider avatar Aug 05 '18 17:08 jtleider

Excellent, thanks again Julien. This is something that I've been putting off for a while! @jraffa, if possible, please could you take a look at this change from a user perspective?

Two things in particular that we need to think about are (1) if/how p-values should be reported for multilevel grouping (2) how n (%) should be reported for categorical variables.

tompollard avatar Aug 06 '18 17:08 tompollard

Couple of comments:

  1. Percentages: Seems like within a (row) variable the column percentages add up to 100%. This is fine, but may not be the desired result. I wonder if having an option to use by row, or by row within the first tier of the column variable is a good idea, or complicates things too much. I usually think about what is the denominator. When setting groupby = ['death','MechVent']:

a. Columnwise: denominator for first column for ICU variable is 110+50+205+103=468 (as in the table header.) b. Rowwise: For CCU: 110+27+11+14=162 c. Rowwise within death=0: 110+27 = 137

Columnwise is probably a good default. Should probably be explained somewhere in the docs.

  1. Hypothesis testing: The present way of doing the testing seems to take the column levels (n and m levels), and makes n*m groups. So setting groupby = ['death','MechVent'] results in the comparison via (e.g.), one-way ANOVA with 4 levels (0.0,0.1,1.0,1.1). This seem to be an ok behaviour. In theory two-way or multi-way ANOVA is possible, but results in two+ p-values (with no interaction). Instead of multiway ANOVA, I think it's more likely that someone would want to compare the the values within a level of the first tier of a column. e.g., Compare among those who died, the mean SysABP: 122.51 (35.68) vs. 110.24 (39.40) for those with vent and no vent, resulting in separate pvalues for death=0 and death = 1. So I would have these two potential methods:

a. If factor one has n levels, and factor two has m levels: Have the default treat crosses of the n and m levels to do a n*m-1 degree of freedom test (as currently done).
b. The other is to stratify into n groups, and do the testing within each group on the m levels of factor two.

I think type b. is probably more intuitive to someone who hasn't read the docs. But I could see the other argument on the other side as well.

Let me know if I have confused you.

jraffa avatar Aug 10 '18 17:08 jraffa