performance
performance copied to clipboard
Assumption checks for `htest` objects
Would be nice to be able to check the assumptions:
-
t.test
:- equal var (for un-paired)
- norm of residuals
-
chisq.test
/prop.test
- N/Cell is > 5 or 10
-
cor.test
- multivariate norm?
-
oneway.test
: same as foraov()
Etc...
I would love to see this feature implemented in the future and would certainly use it to stick as much as possible to the easyverse in my workflow. 👌
This should be it.
Things I'm not sure about:
- When equal-var is not assumed, is it correct to asses normality for each group separately?
- for cor.test - only multivariate normality?
- How to test symmetry for wilcoxon test?
# t.test ------------------------------------------------------------------
## Two sample t-test ------
tt <- t.test(mtcars$mpg, mtcars$hp, var.equal = TRUE)
data <- insight::get_data(tt)
data <- datawizard::data_to_long(data)
m <- lm(Value ~ factor(Name), data = data)
performance::check_normality(m)
performance::check_homogeneity(m)
## Welch Two Sample t-test ------
tt <- t.test(mtcars$mpg, mtcars$hp, var.equal = FALSE)
data <- insight::get_data(tt)
m1 <- lm(data[[1]] ~ 1)
m2 <- lm(data[[2]] ~ 1)
performance::check_normality(m1)
performance::check_normality(m2)
## One sample t-test ------
tt <- t.test(mtcars$mpg)
data <- insight::get_data(tt)
m <- lm(data[[1]] ~ 1)
performance::check_normality(m)
## Paired sample t-test ------
tt <- t.test(mtcars$mpg, mtcars$hp, paired = TRUE)
data <- insight::get_data(tt)
d <- data[[1]] - data[[2]]
m <- lm(d ~ 1)
performance::check_normality(m)
# chisq.test --------------------------------------------------------------
## xtab --------
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477))) / 10
Xsq <- chisq.test(M)
# One of these?
all(Xsq$expected >= 10)
all(Xsq$expected >= 5)
mean(Xsq$expected >= 5) >= .8
## goodness of fit ---------
x <- c(89,37,30,28,2)
p <- c(40,20,20,15,5)
Xsq <- chisq.test(x, p = p, rescale.p = TRUE)
# One of these?
all(Xsq$expected >= 10)
all(Xsq$expected >= 5)
mean(Xsq$expected >= 5) >= .8
# oneway.test -------------------------------------------------------------
## assuming equal variances ----------
Ow <- oneway.test(sleep$extra ~ sleep$group, var.equal = TRUE)
data <- insight::get_data(Ow)
m <- aov(reformulate(names(data)[2], response = names(data)[1]), data = data)
performance::check_normality(m)
performance::check_homogeneity(m)
## not assuming equal variances ----------
Ow <- oneway.test(sleep$extra ~ sleep$group)
data <- insight::get_data(Ow)
by(data[[1]], INDICES = list(data[[2]]),
FUN = function(y) {
performance::check_normality(lm(y ~ 1))
})
# cor.test ----------------------------------------------------------------
## pearson -------
x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
r <- cor.test(x, y)
data <- insight::get_data(r)
MVN::mvn(data, mvnTest = "hz")$multivariateNormality
distances <- mahalanobis(data,
center = colMeans(data),
cov = cov(data))
car::qqPlot(distances, ylab = "Mahalanobis distances (Squared)",
distribution = "chisq", df = mean(distances))
# var.test ----------------------------------------------------------------
x <- rnorm(50, mean = 0, sd = 2)
y <- rnorm(30, mean = 1, sd = 1)
V <- var.test(x, y)
data <- insight::get_data(V)
m1 <- lm(data[[1]] ~ 1)
m2 <- lm(data[[2]] ~ 1)
performance::check_normality(m1)
performance::check_normality(m2)
When equal-var is not assumed, is it correct to asses normality for each group separately?
sure. Holm-correct any p values
for cor.test - only multivariate normality?
yes, this is analogous to normality of residuals
How to test symmetry for wilcoxon test?
A stat is presented here on page 33 https://etheses.bham.ac.uk/id/eprint/4607/1/Voraprateep13MRes.pdf
When equal-var is not assumed, is it correct to asses normality for each group separately?
(For what it's worth) I asked a question related to this on Cross Validated once, the answer suggested that if you have categorical predictors (e.g., groups), then it is suggested that you check assumptions directly on the raw data rather than on the residuals. I was trying to use check_model()
for a planned contrasts lm
model but the resulting figures didn't seem to make too much sense to me.
The distributional assumptions are the same with categorical predictors as with numerical predictors; namely, the conditional distributions are all assumed to be normal with common variance. However, with categorical predictors you do not have to resort to looking at residuals, because you can investigate the conditional distributions directly by subsetting the data for the different levels of the categorical predictor.
https://stats.stackexchange.com/questions/502175/how-to-test-assumptions-with-a-categorical-predictor-planned-contrasts-in-line/502177
Given that I was advised to rely on diagnostic plots rather than objective tests if possible, the way I've been trying to handle this so far is e.g.,
rempsyc::nice_normality(data = iris,
variable = "Sepal.Length",
group = "Species",
shapiro = TRUE)
Residuals and raw data if you only have categorical predictors in a linear model have the same distributional shape, they are only mean-shifted (because the categorical predictors reflect group means).
It is generally preferred to still use residuals so that the mean residual is zero and that things like standardized residuals follow expected distributions (eg, as part of influential points checks or homoskedasticity checks).
- [ ] I'll start by implementing
.htest
methods where needed. - [ ] Write a short vignette?
- [ ] Try and make
check_model.htest
?- Might be easier to just do
check_htest()
instead?
- Might be easier to just do
How to test symmetry for wilcoxon test?
A stat is presented here on page 33 https://etheses.bham.ac.uk/id/eprint/4607/1/Voraprateep13MRes.pdf
check_symmetry <- function(x, ...) {
f_ <- bayestestR::density_at(x, x)
if (var(f_) == 0) {
Sym <- 0
} else {
F_ <- ecdf(x)(x)
Sym <- -cor(f_, F_)
}
Sym
}
check_symmetry(z <- rexp(1000))
#> [1] 0.9375129
check_symmetry(-z)
#> [1] -0.9375129
check_symmetry(x <- rchisq(1000, 3))
#> [1] 0.8415241
check_symmetry(x * 14 + 200)
#> [1] 0.841458
check_symmetry(rchisq(1000, 300))
#> [1] 0.05684856
check_symmetry(rnorm(1000))
#> [1] -0.01754306
check_symmetry(rt(1000, 3))
#> [1] -0.02556873
check_symmetry(runif(1000))
#> [1] 0.3435105
Created on 2022-07-05 by the reprex package (v2.0.1)
Note sure if there is an easy way to actually test that this values is not 0.
Note sure if there is an easy way to actually test that this values is not 0.
value != 0
😬
lmao
Might be easier to just do check_htest() instead?
also, I think it's best to stick with check_model, most novice users using h-tests probably don't actually know they are h-tests (arguably, they also don't realize that they are "models" but still, I think it's fair to assume that our beast function check_model would work for h-tests instead of a bespoke function)
Here is another index, based on https://arxiv.org/pdf/2005.09960.pdf (which has a p-to-statistic table for normal and uniform nulls...)
check_symmetry2 <- function(x) {
(cor(sort(x), sort(x, TRUE)) + 1)/2
}
check_symmetry2(z <- rexp(1000))
#> [1] 0.1714946
check_symmetry2(-z)
#> [1] 0.1714946
check_symmetry2(x <- rchisq(1000, 3))
#> [1] 0.1193973
check_symmetry2(x * 14 + 200)
#> [1] 0.1193973
check_symmetry2(rchisq(1000, 300))
#> [1] 0.0009236334
check_symmetry2(rnorm(1000))
#> [1] 0.001111619
check_symmetry2(rt(1000, 3))
#> [1] 0.004868547
check_symmetry2(runif(1000))
#> [1] 0.0005820885
Created on 2022-07-05 by the reprex package (v2.0.1)
Might be easier to just do check_htest() instead?
I think it's best to stick with check_model, most novice users using h-tests probably don't actually know they are h-tests (arguably, they also don't realize that they are "models" but still, I think it's fair to assume that our beast function check_model would work for h-tests instead of a bespoke function)
Yes, please stick with check_model
if possible! The way I'm trying to advertise the easyverse
to my colleagues is by explaining that everything is as simple as it can be, and that there are often good one-fits-all solutions like check_model
. Using the same function to check all models make a lot more sense in this context, than to have to change function based on the model type... (hell, I had never heard of h-tests before reading this issue!). Thanks! 🙏
What should be the name of the function that checks if there are sufficient Ns for the Chi-sq tests? check_n
or perhaps this too should be check_normality
(as this assumptions is needed for the multinomial sampling distribution to be approximately normal)?
It makes sense in check_normality I think
I like the parsimonious approach of covering this under check_normality()
.
Agreed
I went with Hotelling and Solomons test of symmetry by testing if the standardized nonparametric skew, which has a closed for sig-test.