ietoolkit
ietoolkit copied to clipboard
[iebaltab] : use chi-square for categorical variables
Feedback received through DIME network:
I have been using the iebaltab command for my dissertation and it seems to be using t-tests to compare groups of categorical variables. I believe chi-square tests are more appropriate than t-tests for categorical variables.
The commands table1 in Stata and package tableone in R use chi-square tests for categorical variables.
What do we want to do with this? I think the key is to understand the econometric significance of "more appropriate"?
I looked into the table1
command. There you have to specify which type of variable (binary, categorical or continuous) each variable is. That would require a complete re-make of the work flow in the command (more than what we are already planning).
I think we have to do that if t-tests on categorical variables are incorrect. Then we have no choice. If chi-square tests simple are better then I think it comes down to how much better. I see these three options.
- Implement and update the command and how it is specified. Backward compatibility will be very difficult.
- Keep as is and make sure we explain what we do, and even point to the
table1
command for someone that wants to do different tests for different types of variables.
If chi-square test are simply just a little better but t-tests are still valid, then I think we should go with 2, and say that iebaltab
we have gone for ease of use, but then make it clear that there are other commands that can do it in a more optimized way.
hmm, I think this requires some digging into the literature. I have been asked about chi-square tests before, but don't remember in which context. John offered to help once with the metrics part of the command, so I can get in touch with him for help on this.
This is appropriate when the Stata variable is coded as a true categorical, since Stata holds them as numbers and is more than happy to give you nonsense results by regressing on the underlying numerical values. However most commands do not try to detect or correct this as it is almost impossible to do (since, for example, integer-continuous variables like age
always look categorical but are not). So it is considered the user's responsibility to check these; and for balance it is common practice to check the set of binaries given by tab, gen()
using t-tests. So I would say the existing functionality is standard, not incorrect.
However! It would be a great contribution to have an option or command that correctly handles true categoricals. For example a categoricals()
option that takes an additional varlist, or an iebalcat
command that only takes these, could then split them up and use the chi-square appropriately on the variable as a whole. It would basically be a wrapper and output for the process described here: https://www.cdc.gov/nchs/tutorials/NHANES/NHANESAnalyses/HypothesisTesting/Task3c.htm so it may not be too much work since the output format is already figured.
(Note, however, that controls are harder to implement there...)
This is how we will move forward with this issue:
-
based on this feedback update the help-file with a section on categorical variables. I will do that myself. We will say at this point we do not support chi-square tests of categorical variables, and we tell people that the best
iebaltab
currently can do for them is that they turn the categorical variable to multiple dummies, and theniebaltab
can do t-tests on all those dummies. -
Keep doing the re-write
iebaltab
, and while doing so keep in the back of our heads how we can integrate categorical variables, and do chi-square tests on them. -
If we do not think of a way to do so when re-writing, then create a new command called
iebalcat
that is a balance table command specifically for categorical variables
Modified help file in 51135d91a8e84d524aeba8e3d18de3097444c336 to do task 1 in list above.