tableone
tableone copied to clipboard
Wrong percentage calculated when categorical column includes missing values.
When your categorical variables include missing values, a wrong percentage gets calculated.
See Parameter1:
Currently percentages for categorical variables get calculated with the total being the amount of non-missing values for that variable.
In my opinion the percentage, in cell E3
for example should be 17.1, because you're interested in how many times Parameter1 Category 1.0 occured in the validation_set: 104/609
A quick workaround is to replace all nan values in categorical variables to some number and then dropping the rows with that number:
df[categorical] = df[categorical].replace(np.nan, 199848)
mytable = mytable.tableone[mytable.tableone.index.get_level_values(1) != "199848.0"]
Great library though!
@rherman9 thanks for picking this up. we'll take a look!
To reproduce this issue:
import pandas as pd
from tableone import tableone
df = pd.DataFrame(
{'cats': ["1", "2", "3", "4", None, None],
'set': ["train","train", "val", "val", "val", "val"]}
)
t = tableone(df, groupby = "set")
print(t.tabulate(headers=None, tablefmt="github"))
Output:
Missing | Overall | train | val | ||
---|---|---|---|---|---|
n | 6 | 2 | 4 | ||
cats, n (%) | 1 | 2 | 1 (25.0) | 1 (50.0) | |
2 | 1 (25.0) | 1 (50.0) | |||
3 | 1 (25.0) | 1 (50.0) | |||
4 | 1 (25.0) | 1 (50.0) |
Expected output:
Missing | Overall | train | val | ||
---|---|---|---|---|---|
n | 6 | 2 | 4 | ||
cats, n (%) | 1 | 2 | 1 (25.0) | 1 (50.0) | |
2 | 1 (25.0) | 1 (50.0) | |||
3 | 1 (25.0) | 1 (25.0) | |||
4 | 1 (25.0) | 1 (25.0) |
The best fix for this might be to treat NaN/None etc as a category? @lbulgarelli any thoughts?
It is a good idea to add missing as a category itself, especially because it will allow to easily compare missing values between groups.
That said, the number of missing alone is not very informative for non-categorical variables, so I'd also probably hide that information by default, with the option to display it.
That said, the number of missing alone is not very informative for non-categorical variables, so I'd also probably hide that information by default, with the option to display it.
I feel like it's pretty important to know how many data points are missing, even for continuous variables. If you're reporting a summary statistic and it is based on a small proportion of your overall data, it feels like it would be good to know.
@jraffa any thoughts on this conversation? (how to handle missing values for categorical and continuous variables).