margins icon indicating copy to clipboard operation
margins copied to clipboard

Problem with factors that have unused levels

Open tzoltak opened this issue 5 years ago • 2 comments

margins() fails in a situation when in a model there is a factor variable with some unused level(s):

## load package
library("margins")

## code goes here
mtcars$gear = factor(mtcars$gear,
                     c(sort(unique(mtcars$gear)),
                       "level not in the data"))
table(mtcars$gear)
m <- lm(hp ~ gear, mtcars)
margins(m)

## session info for your system
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=Polish_Poland.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] margins_0.3.23

loaded via a namespace (and not attached):
[1] MASS_7.3-51.4     compiler_3.6.0    tools_3.6.0       data.table_1.12.2
[5] prediction_0.3.14

That's is because dydx() while calling for predictions goes through all the levels of a factor no matter they occur in a data on which model was estimated or not. But in this latter situation model obviously gives no information needed to compute a prediction.

I think that simply adding droplevels() in line 159 of a file "dydx.R" should work:

levs <- levels(droplevels(as.factor(data[[variable]])))

tzoltak avatar Jun 17 '19 20:06 tzoltak

This should be an easy fix. Thanks for reporting!

leeper avatar Dec 22 '19 09:12 leeper

I thought about this some more afterwards and I think, that the proper solution will be different to this proposed above: dxdy() should get set of levels to go through from a model (ie. data stored in a model object), not from a data (even if there are some other levels in a data, model won't tell anything about what to predict for them; of course this only makes a difference if argument data is provided to margins()), however it should also determine reference level using information provided by the at argument of margins() (if a variable under consideration in a given call to dydx() appears in at). So instead of lines 159-161 in dydx.R should be something like this:

levs <- sort(unique(find_data(model)[[variable]]))
if (variable %in% names(attributes(data)$at)) {
  base <- factor(attributes(data)$at[[variable]], levs)
} else {
  base <- levs[1L]
}
levs <- levs[levs != base]

tzoltak avatar Dec 22 '19 18:12 tzoltak