performance
performance copied to clipboard
Alternative R2?
Problem
R2 is affected by the absence of an intercept:
data <- mtcars |>
within({
M = model.matrix(~1 + hp)
})
m1 <- lm(mpg ~ 1 + hp, data = data)
m2 <- lm(mpg ~ 0 + M, data = data) # same model?
Same parameters…
parameters::compare_parameters(m1, m2)
#> Warning in mapply(function(.d, .l) {: longer argument not a multiple of length
#> of shorter
#> Warning: Some model terms could not be found in model data.
#> You probably need to load the data into the environment.
#> Parameter | m1 | m2
#> ----------------------------------------------------------
#> (Intercept) | 30.10 (26.76, 33.44) |
#> hp | -0.07 (-0.09, -0.05) |
#> M(Intercept) | | 30.10 (26.76, 33.44)
#> Mhp | | -0.07 (-0.09, -0.05)
#> ----------------------------------------------------------
#> Observations | 32 | 32
Same dfs…
df.residual(m1)
#> [1] 30
df.residual(m2)
#> [1] 30
Same predictions…
all.equal(predict(m1), predict(m2))
#> [1] TRUE
But…
performance::r2(m1)
#> # R2 for Linear Regression
#> R2: 0.602
#> adj. R2: 0.589
performance::r2(m2)
#> # R2 for Linear Regression
#> R2: 0.968
#> adj. R2: 0.966
This isn’t an issue in performace, but R seems to give different values based on whether or not there is an intercept in the model
summary(m1)$r.squared
#> [1] 0.6024373
summary(m2)$r.squared
#> [1] 0.9681196
Created on 2022-10-14 by the reprex package (v2.0.1)
Solution
Perhaps we can add an r2_prediction
which computes the R2 values based on the prediction?
Either as: $R^2 = r_{\hat{y},y}^2$
Or (my preferred method) as $R^2 = 1 - \frac{Var(y - \hat{y})}{Var(y)}$
This will allow for correct R2 for transformed outcomes (using smearing), for non-linear models, etc...
Can you check what r2.default()
returns?
R is (correctly) adjusting its computation of the predicted sum of squares based on whether there's an intercept or not.
We shouldn't expect these two results to be the same because your model has an intercept in effect even though you told R it didn't.
r <- z$residuals
f <- z$fitted.values
mss <- if (attr(z$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
if (p != attr(z$terms, "intercept")) {
df.int <- if (attr(z$terms, "intercept")) 1L else 0L
ans$r.squared <- mss/(mss + rss)
ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
ans$fstatistic <- c(value = (mss/(p - df.int))/resvar,
numdf = p - df.int, dendf = rdf)
} else ans$r.squared <- ans$adj.r.squared <- 0
What is the justification for this formula @bwiernik ? I haven't come across it before...
See this great StackExchange answer https://stats.stackexchange.com/a/26205/364001