kclass() issues with NAs in the data frame
I was trying to use the kclass() function with a data frame where one of my variables has a couple of NAs. I got some weird errors though. I'm quite new to R so I did my best in creating an example to reproduce the error, I posted it below. The function performs well if i haven't got any NAs in the data but as soon as i put some in the df it gets a bit odd which at least supported my assumption that it didn't work because of missing values:
library(magrittr)
library(RcompAngrist)
a <- runif(10, 5, 90)
b <- runif(10, 4, 10)
c <- runif(10, 0, 1)
d <- runif(10, 5, 65)
e <- runif(10, 1, 2)
f <- runif(10, 1, 100)
g <- runif(10, 80, 90)
h <- c(1,12,3,5,NA,16,17,NA,9,10)
df <- data.frame(a,b,c,d,e,f,g,h)
dummy <- kclass(a ~ b + c + d | d + e + f + g + h,
model = T,
data=df)
So now i get this error message:
"Error in cbind(x_exo, z, x_endo, y) : number of rows of matrices must match (see arg 2)"
Alright fair enough, I thought if i just omit the NAs it should work again. Although i did ask myself why the na.action doesn't work. It is set on na.omit and i also tried to put it in the function directly. It didn't change the behavior though. So put this line of code in right before you reestimate the model: df <- data.frame(a,b,c,d,e,f,g,h) %>% na.omit()
which results in this error message:
"Error in R_Z[c(n_G, n_y), c(n_G, n_y)] : subscript out of bounds"
Now i'm completely lost, but it got even weirder. If you omit "data=df" from the function and then rerun the model the error message switches back to "Error in cbind [...]". Does anyone have any ideas on how to fix this? It should in my opinion just run the model and omit rows with missing values.
Hey @MatthieuStigler, the error is reproducible with this code:
df <- data.frame(
a = runif(10, 5, 90),
b = runif(10, 4, 10),
c = runif(10, 0, 1),
d = runif(10, 5, 65),
e = runif(10, 1, 2),
f = runif(10, 1, 100),
g = runif(10, 80, 90),
h = c(1,12,3,5,NA,16,17,NA,9,10)
)
RcompAngrist::kclass(a ~ b + c + d | d + e + f + g + h,
data = na.omit(df))
#> Error in R_Z[c(n_G, n_y), c(n_G, n_y)]: subscript out of bounds
kclass_fit.R:164 is where the error seems to happen, since R_34 was created with rows = n_G and column n_y. Seems line 164 should be:
R_34_34 <- R_Z[n_G, c(n_G, n_y)]
but then I get errors a few lines down with backsolve.
Hi
There are actually two types of error occurring here. The first is due to the NA, which is not handled by kclass(). I should add a check/warning for this case, asking the user to do it manually.
The second issue is that once you removed the NA, your dataset is so small (you have exactly as many observationsas exo/endo variables), that indeed the algorithm breaks. Try to add one observation more and it should work. I hope that in a real case, you do not have so few observations? Thanks!
Matthieu
Hi Matthieu,
thanks for the reply. I tried adding na.action = na.omit as an argument but i actually have to create a subset of my df?
Ok so the second issue is not actually an issue because i created it with my example. I have much more observations so that's fine.
Thanks for the clarifications.