lavaan
lavaan copied to clipboard
Casewise Residuals Sometimes all 0
This happens in some models but not others, but I have a repeatable example for 0.5-20:
library(semTools)
library(lavaan)
borModel <- '
NDVI ~ nTot + T61 + Wet
nTot ~ T61
'
borFit <- sem(borModel, data=boreal, meanstructure=T)
> head(residuals(borFit, "casewise"))
NDVI nTot T61 Wet
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
[4,] 0 0 0 0
[5,] 0 0 0 0
[6,] 0 0 0 0
This is incorrect - there should be non-zero residuals for NDVI and nTot. This doesn't always happen, though, so I'm not sure what's going on here. FYI, this example worked in 0.5-17 just fine.
I am afraid this is not a bug, but a (huge) misunderstanding. The residuals are based on the lavPredict() function, and the main purpose of this function is (and always was) to compute factor scores for the latent variables. Once we have these factor scores, type="ov" computes model-predicted values for the observed indicators of these latent variables; and we can compare these with the observed values; this is the purpose of residuals(, "casewise").
When there are no latent variables, the latent variables 'should' be the same as the observed values, and the residuals will be zero. The 'structural' (eg regression) part of the model is not used, and therefore, the lavPredict() function is useless to 'predict' outcome variables, as in using predict() on a lm() object. Unfortunately, there was a bug in 0.5-17 (also 0.5-18?) (only when there were no latent variables, and when exogenous 'x' covariates were used in the model), and some people got the impression that lavPredict() did predict future values. The 'bug' was simply there because at the time, I did not even consider models without latent variables.
See also: https://groups.google.com/forum/#!topic/lavaan/wXU9gLPopjs
So, lavPredict() and resid(,"casewise") in 0.5-20 at least behave as intended, and ignore the structural part of the model. Having said this, it is clear that there is a need for 'prediction-as-in-regression' capabilities. So I will investigate the possibility of doing this, and perhaps add a 'structural = TRUE' argument to the lavPredict() function.
I will leave this issue open, until this is done.
Huh. Did not realize this. Indeed, I think a lot of folk are interested in prediction as in regression predictions. I've figured out some hacky ways to do it (more anon), but predictions are becoming increasingly important - if only to look at residuals to diagnose violations of assumptions. I've been using it to look at spatiotemporal autocorrelation - so, hoping that this gets back on track!
Here's some code that can fix the problem. I'll need to figure out how to put it into the package, unless you can do it more quickly. I'll also post it on the google group so others can use it. library(lavaan)
predict_lavaan <- function(fit, newdata = NULL){
stopifnot(inherits(fit, "lavaan"))
#Make sure we can use this
if(!inspect(fit, "meanstructure")) stop("Need to supply meanstructure = TRUE in fit\n")
if(is.null(newdata)){
newdata <- data.frame(inspect(fit, "data"))
names(newdata) <- lavNames(fit)
}
if(length(lavNames(fit, type="lv"))!=0) stop("Does not currently work with latent variables\n")
#check for new data
if(sum(!(lavNames(fit, type="ov.x") %in% names(newdata)))>0) stop("Not all exogenous variables supplied!")
#Add some new columns to newdata
newdata$Intercept <- 1
newdata[lavNames(fit, "ov.nox")] <- 0
mod_df <- data.frame(lhs = fit@ParTable$lhs,
op = fit@ParTable$op,
rhs = fit@ParTable$rhs,
exo = fit@ParTable$exo,
est = fit@ParTable$est,
se = fit@ParTable$se,
stringsAsFactors=FALSE)
#Drop covariances
mod_df <- mod_df[-which(mod_df$op=="~~"),]
mod_df[which(mod_df$op=="~1"),]$rhs <- "Intercept"
#get rid of exogenous on lhs
mod_df <- mod_df[-which(mod_df$exo==1),]
#Order by lhs
mod_df <- mod_df[sort(mod_df$lhs, index.return=TRUE)$ix,]
#let us know which variables on the rhs are exogenous
mod_df$ord <- 0
mod_df[which(!(mod_df$rhs %in% mod_df$lhs)),]$ord <- 1
#Make a "order"
ord_current <- 1
while(sum(mod_df$ord==0)>0){
for(r in unique(mod_df$lhs)){
val <- sum(mod_df[which(mod_df$lhs==r),]$ord==0)
if(val==0) {
mod_df[which(mod_df$lhs==r),]$ord <- ord_current
if(sum(mod_df$rhs==r)>0)
mod_df[which(mod_df$rhs==r),]$ord <- ord_current+1
}
}
ord_current <- ord_current +1
}
#correct for ragged ordering
for(r in unique(mod_df$lhs)){
mod_df[which(mod_df$lhs==r),]$ord <- max(mod_df[which(mod_df$lhs==r),]$ord)
}
#sort by order
mod_df <- mod_df[sort(mod_df$ord, index.return=TRUE)$ix,]
#now do the fitting in order
fit_df <- data.frame(base = rep(1, nrow(newdata)))
for(r in unique(mod_df$lhs)){
subdf <- subset(mod_df, mod_df$lhs==r)
#make a formula
rhs <- paste0(subdf$rhs, collapse=" + ")
form <- as.formula(paste0(r, " ~ ", rhs))
#use formula to get right part of the data in right format
mod_mat <- model.matrix(form, newdata)[,-1]
new_val = mod_mat %*% subdf$est
fit_df[[r]] <- new_val
newdata[[r]] <- new_val
}
return(fit_df[,-1])
}
fitted_lavaan <- function(fit){
predict_lavaan(fit)
}
residuals_lavaan <- function(fit){
fitted_vals <- fitted_lavaan(fit)
rawdata <- data.frame(inspect(fit, "data"))
names(rawdata) <- lavNames(fit)
res <- data.frame(base = rep(1, nrow(rawdata)))
for(vals in names(fitted_vals)){
res[[vals]] <- rawdata[[vals]] - fitted_vals[[vals]]
}
return(res[,-1])
}
############################################
########### EXAMPLE CODE
############################################
#
# iris_mod <- "
# Sepal.Width ~ Sepal.Length
# Petal.Width ~ Sepal.Width + Petal.Length
#
# "
#
# iris_fit <- sem(iris_mod, data=iris, meanstructure = TRUE)
#
# fitted_lavaan(iris_fit)
#
# res <- residuals_lavaan(iris_fit)
#
# par(mfrow=c(1,2))
# for(i in 1:2) {
# qqnorm(res[,i], main=names(res)[i])
# qqline(res[,i])
# }
# par(mfrow=c(1,1))
#
# newdata <- data.frame(Sepal.Length=1:10, Petal.Length=11:20)
# pred <- predict_lavaan(iris_fit, newdata=newdata)
#
# plot(Petal.Width ~ Sepal.Length, data=cbind(newdata, pred))
#
# library(mvnormtest)
# mshapiro.test(t(res))
Dear @jebyrnes,
I tried this formula and I got an error in the section #Make a "order"
#Make a "order"
ord_current <- 1
while(sum(mod_df$ord==0)>0){
for(r in unique(mod_df$lhs)){
val <- sum(mod_df[which(mod_df$lhs==r),]$ord==0)
if(val==0) {
mod_df[which(mod_df$lhs==r),]$ord <- ord_current
if(sum(mod_df$rhs==r)>0)
mod_df[which(mod_df$rhs==r),]$ord <- ord_current+1
}
}
ord_current <- ord_current +1
}
Error in `$<-.data.frame`(`*tmp*`, "ord", value = 3) :
replacement has 1 row, data has 0
This is because val
does not consider that you can have more than a zero for each lhs
, I think.
I would try to write an alternative code, but I do not completely get your idea about the order, which I understand that is to write the regression formula.
If it helps you this is how mod_df
looks like for me until the previous step.
lhs op rhs exo est se ord
24 CEC.A ~ OC.A 0 1.375677e-01 0.02712223 0
25 CEC.A ~ clay.A 0 8.451412e-01 0.02871379 0
85 CEC.A ~1 Intercept 0 5.022858e-07 0.03176033 1
26 CEC.B ~ clay.B 0 8.192808e-01 0.03126524 0
86 CEC.B ~1 Intercept 0 2.128903e-07 0.04375848 1
27 CEC.C ~ clay.C 0 7.253906e-01 0.04625144 0
87 CEC.C ~1 Intercept 0 -2.574665e-07 0.05893352 1
4 clay.A ~ clay.C 0 6.071796e-01 0.06369167 0
5 clay.A ~ evisd 0 1.497936e-01 0.08548429 1
6 clay.A ~ lstm 0 -9.024666e-02 0.08517614 1
7 clay.A ~ ndwi.b 0 1.599675e-01 0.06871731 1
80 clay.A ~1 Intercept 0 4.418204e-16 0.06025266 1
8 clay.B ~ clay.A 0 3.948407e-01 0.06084509 0
9 clay.B ~ clay.C 0 4.948342e-01 0.06206997 0
10 clay.B ~ vdchn 0 -1.150797e-01 0.05308073 1
11 clay.B ~ twi 0 -2.696901e-02 0.05289084 1
12 clay.B ~ ndwi.b 0 -3.317706e-02 0.04815650 1
81 clay.B ~1 Intercept 0 1.955362e-16 0.04710777 1
1 clay.C ~ dem 0 2.045177e-01 0.30687081 1
2 clay.C ~ vdchn 0 -2.711864e-02 0.07350381 1
3 clay.C ~ X 0 6.265283e-01 0.30785025 1
79 clay.C ~1 Intercept 0 4.015633e-18 0.07211773 1
13 OC.A ~ clay.A 0 3.329283e-01 0.07697578 0
14 OC.A ~ evisd 0 -2.335329e-01 0.10396378 1
15 OC.A ~ lstm 0 -2.016699e-01 0.10219150 1
16 OC.A ~ ndwi.b 0 -1.860915e-01 0.08437564 1
82 OC.A ~1 Intercept 0 5.191650e-16 0.07287413 1
17 OC.B ~ OC.A 0 3.918362e-01 0.06900830 0
18 OC.B ~ clay.B 0 2.442477e-01 0.06948612 0
19 OC.B ~ evisd 0 -2.452562e-01 0.14668759 1
20 OC.B ~ lstm 0 2.803228e-01 0.10015181 1
21 OC.B ~ ndwi.a 0 1.509365e-01 0.15520116 1
22 OC.B ~ vdchn 0 -8.339311e-02 0.06815678 1
83 OC.B ~1 Intercept 0 -3.091703e-15 0.06520413 1
23 OC.C ~ OC.B 0 7.822581e-01 0.04819457 0
84 OC.C ~1 Intercept 0 1.426334e-07 0.04803268 1
For out-of-sample predictions (in the regression sense), we now (0.6-13) have the lavPredictY() function.
Hi,
does lavPredictY() functions with binary outcomes (i.e., ordered = c("y-variable")) I am not able to use the function to calculate predicted probabilities as i would do with probit or logit models using the R base predict() function.