performance icon indicating copy to clipboard operation
performance copied to clipboard

Alternative R2?

Open mattansb opened this issue 2 years ago • 4 comments

Problem

R2 is affected by the absence of an intercept:

data <- mtcars |> 
  within({
    M = model.matrix(~1 + hp)
  })

m1 <- lm(mpg ~ 1 + hp, data = data)
m2 <- lm(mpg ~ 0 + M, data = data) # same model?

Same parameters…

parameters::compare_parameters(m1, m2)
#> Warning in mapply(function(.d, .l) {: longer argument not a multiple of length
#> of shorter
#> Warning: Some model terms could not be found in model data.
#>   You probably need to load the data into the environment.
#> Parameter    |                   m1 |                   m2
#> ----------------------------------------------------------
#> (Intercept)  | 30.10 (26.76, 33.44) |                     
#> hp           | -0.07 (-0.09, -0.05) |                     
#> M(Intercept) |                      | 30.10 (26.76, 33.44)
#> Mhp          |                      | -0.07 (-0.09, -0.05)
#> ----------------------------------------------------------
#> Observations |                   32 |                   32

Same dfs…

df.residual(m1)
#> [1] 30
df.residual(m2)
#> [1] 30

Same predictions…

all.equal(predict(m1), predict(m2))
#> [1] TRUE

But…

performance::r2(m1)
#> # R2 for Linear Regression
#>        R2: 0.602
#>   adj. R2: 0.589
performance::r2(m2)
#> # R2 for Linear Regression
#>        R2: 0.968
#>   adj. R2: 0.966

This isn’t an issue in performace, but R seems to give different values based on whether or not there is an intercept in the model

summary(m1)$r.squared
#> [1] 0.6024373
summary(m2)$r.squared
#> [1] 0.9681196

Created on 2022-10-14 by the reprex package (v2.0.1)

Solution

Perhaps we can add an r2_prediction which computes the R2 values based on the prediction?

Either as: $R^2 = r_{\hat{y},y}^2$

Or (my preferred method) as $R^2 = 1 - \frac{Var(y - \hat{y})}{Var(y)}$

This will allow for correct R2 for transformed outcomes (using smearing), for non-linear models, etc...

mattansb avatar Oct 14 '22 17:10 mattansb

Can you check what r2.default() returns?

strengejacke avatar Oct 14 '22 18:10 strengejacke

R is (correctly) adjusting its computation of the predicted sum of squares based on whether there's an intercept or not.

We shouldn't expect these two results to be the same because your model has an intercept in effect even though you told R it didn't.


r <- z$residuals
f <- z$fitted.values

mss <- if (attr(z$terms, "intercept"))
            sum((f - mean(f))^2) else sum(f^2)
        rss <- sum(r^2)

if (p != attr(z$terms, "intercept")) {
	df.int <- if (attr(z$terms, "intercept")) 1L else 0L
	ans$r.squared <- mss/(mss + rss)
	ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
	ans$fstatistic <- c(value = (mss/(p - df.int))/resvar,
			    numdf = p - df.int, dendf = rdf)
    } else ans$r.squared <- ans$adj.r.squared <- 0

bwiernik avatar Oct 14 '22 22:10 bwiernik

What is the justification for this formula @bwiernik ? I haven't come across it before...

mattansb avatar Oct 16 '22 11:10 mattansb

See this great StackExchange answer https://stats.stackexchange.com/a/26205/364001

bwiernik avatar Oct 16 '22 11:10 bwiernik