fixest
fixest copied to clipboard
Issue with Handling Special Characters in e.g. Polynomial Expressions in feglm
Hello and first off, thank you for developing this fantastic package! It's been incredibly useful.
That being said it does however seem to have a problem with combining special characters from e.g. foreign languages with polynomial expressions of covariates. It seems like fixest::feglm function misinterpret the formula, leading to an error.
Small example:
data <- data.frame(
y = rpois(1000, 1),
gender = sample(c(0,1), 1000, replace = T),
Løn = sample(seq(1e5,1e6,1e3), 1000, replace = T), #Danish for salary
salary = sample(seq(1e5,1e6,1e3), 1000, replace = T)
)
fixest::feglm(
y~ gender + Løn,
data
)
###
### this works fine and yields
###
#> GLM estimation, family = gaussian, Dep. Var.: y
#> Observations: 1,000
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.956695917 0.080613179 11.867736 < 2.2e-16 ***
#> gender 0.023749757 0.063454728 0.374279 0.70828
#> Løn 0.000000033 0.000000121 0.273088 0.78484
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -1,419.2 Adj. Pseudo R2: -0.001335
#> BIC: 2,859.2 Squared Cor.: 2.105e-4
###
### furthere more this also
###
fixest::feglm(
y~ gender + salary^2,
data
)
###
### However, combining these steps i.e
###
fixest::feglm(
y~ gender + Løn^2,
data
)
###
### Yields an error of
###
# Error in fixest::feglm(y ~ gender + Løn^2, data) :
# Evaluation of the right-hand-side of the formula raises an error:
# In LøI(n^2): could not find function "LøI"
###
### The error seems to arise from your internal function fixest_fml_rewriter. "This leads to an unwanted rewriting of the formula expression as:
###
fixest:::fixest_fml_rewriter(as.formula(y~ gender + Løn^2))
# $fml
# y ~ gender + LøI(n^2)
# <environment: 0x5621a72d5cd8>
#
# $isPanel
# [1] FALSE
Hi, and glad you find the software useful!
Hmmm, it works on my machine:
fixest:::fixest_fml_rewriter(as.formula(y~ gender + Løn^2))
$fml
y ~ gender + I(Løn^2)
<environment: 0x000001c6437a03e8>
$isPanel
[1] FALSE
The current rewriting of "x^2"
into "I(x^2)"
uses a lot of regular expressions. In particular, I use "[[:alnum:]]"
to catch letters and deduce variables' names.
Can you replicate the following result?
gsub("[[:alnum:]]", "_", "Løn^2")
[1] "___^_"
If not, it seems that the current interpretation of the character signs differ between your machine and mine. Possible solutions:
- update the version of R?
- change the encoding of your file to UTF8?
In any case, writing explicitly " I(Løn^2)" should work (and this is the native R way to do it).
It seem that gsub does produce the same result:
gsub("[[:alnum:]]", "_", "Løn^2")
[1] "___^_"
However explicitly writing "I(Løn^2)" produce an even weirder result:
fixest:::fixest_fml_rewriter(as.formula(y~ gender + I(Løn^2)))
$fml
y ~ gender + I(LøI(n^2))
<environment: 0x56247917cad0>
$isPanel
[1] FALSE
The problem seem to arise from the following steps:
fml_text = fixest:::deparse_long(as.formula(y~ gender + Løn^2))
fml_text
[1] "y ~ gender + Løn^2"
no_lhs_text = gsub("^[^~]+~", "", fml_text)
no_lhs_text
[1] " gender + Løn^2"
no_lhs_text = gsub("(?<!I\\()(\\b(\\.[[:alpha:]]|[[:alpha:]])[[:alnum:]\\._]*\\^[[:digit:]]+)", "I(\\1)",
no_lhs_text, perl = TRUE)
no_lhs_text
[1] " gender + LøI(n^2)"
Session info:
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS: /opt/R/4.2.1/lib64/R/lib/libRblas.so
LAPACK: /opt/R/4.2.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
character(0)
other attached packages:
[1] fixest_0.11.2