fixest icon indicating copy to clipboard operation
fixest copied to clipboard

Original data set cannot be safely accessed when the estimation is performed in a loop

Open lrberge opened this issue 3 years ago • 7 comments

Example

library(fixest)

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
all_species = unique(base$species)
res_all = list()
for(i in 1:3){
    s = all_species[i]
    res_all[[i]] = feols(y ~ x1 + x2, base[base$species == s, ])
}

# Have a look at the mean of the dependent variable:
etable(res_all, fitstat = "my")
#>                            model 1            model 2            model 3
#> Dependent Var.:                  y                  y                  y
#>                                                                         
#> (Intercept)      2.304*** (0.3853)  2.116*** (0.4943)    0.6248 (0.5249)
#> x1              0.6674*** (0.0904)    0.2476 (0.1868)   0.2600. (0.1533)
#> x2                 0.2834 (0.1972) 0.7356*** (0.1248) 0.9348*** (0.0896)
#> _______________ __________________ __________________ __________________
#> S.E. type                      IID                IID                IID
#> Dep. Var. mean              6.5880             6.5880             6.5880

What happens

In fixest many computations are delayed, occurring after the estimation and only at the user's request. This is true for computing the standard-errors (when clustering w.r.t. a variable not used in the estimation) and for several fit statistics.

For post-computation to work properly, the original data, the one used for the estimation, needs to be accessed. Of course, one solution would be to store the original data in the estimation object. This would be safe but would come at an exorbitant memory price. Currently, the data is accessed by using the same data call as in the estimation: it is just an access to a data set currently stored in memory. Following the example, base[base$species == s, ] is evaluated to get the data.

In general, this is fine. Now comes the loop problem. When information have to be computed after the loop, like in the example the mean of the dependent variable, for all three models the data is accessed with base[base$species == s, ], leading to erroneously use the same data set. Hence leading to wrong results.

Solutions

VCOV

This problem affects the VCOV if clustering (or any other VCOV using an extra variable) has to be performed ex-post. The solution is then to use the argument vcov at estimation time.

Note that you will not be able to navigate through different VCOVs (other than standard and heteroskedasticity-robust) ex post. If needed, you will have to store the object with the different VCOVs within the loop, while the right data is accessible.

fit statistics

Currently there is no way to store the fit statistics at estimation time. There is a major overhaul of the fit statistics mechanism under way. Once the new fit-stats are implemented, that will be possible and easy (basically just using fitstat = TRUE will do).

In the short run, if you want to use data-dependent fit-stats ex post, you need to do it manually. FWIW, here's an example:

# A) create a custom fit stat
my_meandep = function(x){
    # x: fixest object
    if("mean_dep" %in% names(x)) x$mean_dep else NA
}

# B) register it
fitstat_register("meandep", my_meandep, "Mean of the Dep. Var.")

# C) within the loop, save the requested information 

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
all_species = unique(base$species)
res_all = list()
for(i in 1:3){
    s = all_species[i]
    
    # C') => we create the slot mean_dep
    mod = feols(y ~ x1 + x2, base[base$species == s, ])
    mod$mean_dep = fitstat(mod, "my", simplify = TRUE)
    res_all[[i]] = mod
}

# D) use it
etable(res_all, fitstat = ~my + meandep)
#>                                  model 1            model 2            model 3
#> Dependent Var.:                        y                  y                  y
#>                                                                               
#> (Intercept)            2.304*** (0.3853)  2.116*** (0.4943)    0.6248 (0.5249)
#> x1                    0.6674*** (0.0904)    0.2476 (0.1868)   0.2600. (0.1533)
#> x2                       0.2834 (0.1972) 0.7356*** (0.1248) 0.9348*** (0.0896)
#> _____________________ __________________ __________________ __________________
#> S.E. type                            IID                IID                IID
#> Dep. Var. mean                    6.5880             6.5880             6.5880
#> Mean of the Dep. Var.             5.0060             5.9360             6.5880

lrberge avatar Jul 21 '22 08:07 lrberge

Just wanted to +1 this issue. marginaleffects must rely on some ugly tricks to get the dataset from by evaluating its name in the global environment. This is unsafe and will lead to bad results if the user re-assigns the data object.

It would be awesome if we could store the original dataset in fixest objects for easy retrieval.

As always, thanks for your great work on this!

vincentarelbundock avatar Jul 23 '22 23:07 vincentarelbundock

I see now that I misread your post and that you do not plan to attach the original data.

FWIW, here’s an example of the problematic situation I had in mind. The first model should produce a marginal effects data frame with 32 rows because there are 32 observations. Yet, because we reassign an object in the global environment, trying to access the feols data produces incorrect results.

Of course, this is dangerous behavior by the user, but reassigning data frames is super common practice…

library(fixest)
library(marginaleffects)

dat <- mtcars
mod1 <- feols(hp ~ mpg | cyl, data = dat)

dat <- dat[1:20,]
mod2 <- feols(hp ~ mpg | cyl, data = dat)

marginaleffects(mod1) |> nrow()
#> [1] 20

marginaleffects(mod2) |> nrow()
#> [1] 20

vincentarelbundock avatar Jul 24 '22 02:07 vincentarelbundock

Chiming in with my own examination of the problem. feols() stores a call_env component in the output, which is meant to store the calling environment. However, neither this environment nor the environment of the formula in the fml component actually refer to the environments where the function call or formula were evaluated. It is impossible to retrieve the actual environment where feols() was called.

# Evaluating feols() in its own environment
f1 <- function() {
  d1 <- data.frame(y = rnorm(10),
                   x = rnorm(10))
  
  fixest::feols(y ~ x, data = d1)
}
fit1 <- f1()

# d1 is no where to be found
list(call_env = ls(fit1$call_env))
#> $call_env
#> character(0)
list(fml = ls(environment(fit1$fml)))
#> $fml
#> [1] "lhs" "res" "rhs"

# Evaluating lm() in its own environment
f2 <- function() {
  d2 <- data.frame(y = rnorm(10),
                   x = rnorm(10))
  
  lm(y ~ x, data = d2)
}
fit2 <- f2()

# d2 found easily in formula environment
list(terms = ls(environment(fit2$terms)))
#> $terms
#> [1] "d2"

Created on 2022-09-01 with reprex v2.0.2

This is problematic when using feols() inside testthat() since the variable scoping is really weird with it. If I am trying to use marginaleffects or insight::get_data(), the data cannot be found because no information about the calling environment of feols() is actually stored in the output object. That is, even if I am willing to use the name of the original dataset and I haven't touched that dataset, I cannot access it if it is not in the global environment (which would be true inside simulations or testthat blocks.

I encourage you to consider at least providing an option for users to store the original dataset in the output object, perhaps with the lean option allowing users with huge datasets to turn this off. A persistent version of the original dataset must exist somewhere; if that is in a new environment created by feols() or in the output object itself, it will take up the same memory regardless, and you might have to swallow allowing that to happen.

ngreifer avatar Sep 01 '22 19:09 ngreifer

Don't have the time to reply sorry, but just to say that I didn't write insight::get_data().

The value call_env is internal cuisine, and it works fine. It is used internally in the following unexported function:

fixest:::fetch_data(fit1)
#>              y          x
#> 1  -0.86791309 -0.6534246
#> 2   0.09164135 -1.0960332
#> 3  -0.45866784 -0.9918161
#> 4  -1.12278344 -0.7937952
#> 5   0.11677956  0.4422500
#> 6  -0.45137917  1.1813671
#> 7   1.07984048  0.5867055
#> 8  -0.13633684 -2.0547528
#> 9   0.20415421 -1.7486942
#> 10  0.31391199 -2.0428270

I didn't save the environment directly but a child of the calling environment, to allow the env. to be garbage collected if need (but I'm not sure if that happens in the end). Here to clarify:

list(call_env = ls(parent.env(fit1$call_env)))
#> $call_env
#> [1] "d1"

Or stated differently:

eval(fit1$call$data, fit1$call_env)
#>              y          x
#> 1  -0.86791309 -0.6534246
#> 2   0.09164135 -1.0960332
#> 3  -0.45866784 -0.9918161
#> 4  -1.12278344 -0.7937952
#> 5   0.11677956  0.4422500
#> 6  -0.45137917  1.1813671
#> 7   1.07984048  0.5867055
#> 8  -0.13633684 -2.0547528
#> 9   0.20415421 -1.7486942
#> 10  0.31391199 -2.0428270

Option to save makes sense in some circumstances and will be there. Just wait for my teaching semester to end please.

lrberge avatar Sep 02 '22 13:09 lrberge

This is very helpful, thank you!

ngreifer avatar Sep 02 '22 15:09 ngreifer

Hi, with a huge time lag (sorry Vincent...), there's the new argument data.save:

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
est = feols(y ~ x1, base, data.save = TRUE)
rm(base)

#
# the code below would not have worked with data.save = FALSE
#

vcov(est, ~species)
#>             (Intercept)         x1
#> (Intercept)   0.8823064 -0.3670791
#> x1           -0.3670791  0.1654361

update(est, .~.+x2)
#> OLS estimation, Dep. Var.: y
#> Observations: 150
#> Standard-errors: IID
#>             Estimate Std. Error  t value   Pr(>|t|)
#> (Intercept) 2.249140   0.247970  9.07022 7.0385e-16 ***
#> x1          0.595525   0.069328  8.58994 1.1633e-14 ***
#> x2          0.471920   0.017118 27.56916  < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.329937   Adj. R2: 0.838003

BTW I now exposed the function fixest_data to get the data set used for the estimation (following Kyle's suggestion #465), which accounts for this new mechanism. Once this version of fixest is on CRAN I may write a PR to insight::get_data() to account for that.

lrberge avatar Feb 13 '24 09:02 lrberge

Oh, this looks fantastic!

Feel free to ping me on the insight repo when it's out on CRAN. I still have commit rights there and am happy to help with implementation.

vincentarelbundock avatar Feb 13 '24 11:02 vincentarelbundock