staggered icon indicating copy to clipboard operation
staggered copied to clipboard

Unable to replicate did output using staggered

Open fschoner opened this issue 2 years ago • 9 comments

I get an error when trying to replicate results obtained from the did package using the staggered_cs command. More specifically,

sg_attgt <- att_gt(
     yname = outcomes[[1]], gname = "first_treat", idname = "pid",
     tname = "Enrolment", control_group = "notyettreated", panel = FALSE,
     xformla = ~ 1, data = df_reg, est_method = "dr", 
     bstrap = TRUE, cband = TRUE
  )

runs through smoothly whereas

staggered_cs(
  df_reg,
  i = "pid,
  t = "Enrolment",
  g = "first_treat",
  y = outcomes[[1]],
  estimand = "simple"
)

throws the following error.

Error in `dplyr::filter()`:
! Problem while computing `..1 = t >= g`.
Caused by error in `t >= g`:
! comparison (>=) is possible only for atomic and list types

Any ideas?

Unfortunately, I cannot share the data. I can try to provide a reproducible example next week if that's helpful. Let me know please.

Have a nice weekend!

fschoner avatar Nov 25 '22 16:11 fschoner

Hi,

Thanks for the message, and sorry you're having issues.

Is the panel you're using unbalanced? If so, does it work after balancing the panel?

I could potentially help more if there's a working example that you can share. Based on the error message that you reported, my guess is that this has something to do with the combinations of (t,g) pairs in the data, and not the outcome. If you could share a modified version of the data with the same (t,g) combinations but made-up outcomes and ID variables, that would be helpful.

Best, Jon

On Fri, Nov 25, 2022 at 11:04 AM fschoner @.***> wrote:

I get an error when trying to replicate results obtained from the did package using the staggered_cs command. More specifically,

sg_attgt <- att_gt( yname = outcomes[[1]], gname = "first_treat", idname = "pid", tname = "Enrolment", control_group = "notyettreated", panel = FALSE, xformla = ~ 1, data = df_reg, est_method = "dr", bstrap = TRUE, cband = TRUE )

runs through smoothly whereas

staggered_cs( df_reg, i = "pid, t = "Enrolment", g = "first_treat", y = outcomes[[1]], estimand = "simple" )

throws the following error.

Error in dplyr::filter(): ! Problem while computing ..1 = t >= g. Caused by error in t >= g: ! comparison (>=) is possible only for atomic and list types

Any ideas?

Unfortunately, I cannot share the data. I can try to provide a reproducible example next week if that's helpful. Let me know please.

Have a nice weekend!

— Reply to this email directly, view it on GitHub https://github.com/jonathandroth/staggered/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6EXFGSMZDTHJUPXQVCPDDWKDPQLANCNFSM6AAAAAASLOKEYM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jonathandroth avatar Nov 28 '22 21:11 jonathandroth

Hi Jon,

thanks for getting back, really appreciated. I'm actually using a repeated cross-section (which can easily be turned into a panel by aggregating).

It was indeed unbalanced, but I received exactly the same error after balancing it (by just omitting the time periods where I lacked observations for some of the states).

I simulated data and was able to replicate the very same behavior: att_gt is fine, but staggered_cs throws the above printed error. Please find the code below.

library(data.table)
library(staggered)
library(did)

n <- 1000
df_staggered <- data.table(
  pid = 1:n,
  state = sample(c(1:16), replace = TRUE, size = n),
  enrolment = sample(c(1992:2009), replace = TRUE, size = n),
  outcome = rnorm(n)
)
# Create group variable.
df_staggered[
  , 
  first_treat := fcase(
    state == 1, 2007,
    state %in% c(4, 10), 2001,
    state == 7, 2003
  ) 
]
# Run staggered estimation 
staggered_cs(
  df_staggered,
  i = "pid",
  t = "enrolment",
  g = "first_treat",
  y = "outcome",
  estimand = "simple"
)
# Run C/S estimation
sg_attgt <- att_gt(
  yname = "outcome", gname = "first_treat", idname = "pid",
  tname = "enrolment", control_group = "notyettreated", panel = FALSE,
  xformla = ~ 1, data = df_staggered, est_method = "dr", 
  bstrap = TRUE, cband = TRUE
)

Best, Florian

fschoner avatar Nov 30 '22 17:11 fschoner

Thanks for the follow-up.

I am confused by the example. It appears to me that each pid shows up only once? If you have T periods, each id variable should show up T times, once for each period.

(As an aside, although I don't think it's causing the issue, is that for staggered you should use g=Inf for never-treated, not missing.)

Best, Jon

On Wed, Nov 30, 2022 at 12:32 PM fschoner @.***> wrote:

Hi Jon,

thanks for getting back, really appreciated. I'm actually using a repeated cross-section (which can easily be turned into a panel by aggregating).

It was indeed unbalanced, but I received exactly the same error after balancing it (by just omitting the time periods where I lacked observations for some of the states).

I simulated data and was able to replicate the very same behavior: att_gt is fine, but staggered_cs throws the above printed error. Please find the code below.

library(data.table) library(staggered) library(did)

n <- 1000 df_staggered <- data.table( pid = 1:n, state = sample(c(1:16), replace = TRUE, size = n), enrolment = sample(c(1992:2009), replace = TRUE, size = n), outcome = rnorm(n) )

Create group variable.

df_staggered[ , first_treat := fcase( state == 1, 2007, state %in% c(4, 10), 2001, state == 7, 2003 ) ]

Run staggered estimation

staggered_cs( df_staggered, i = "pid", t = "enrolment", g = "first_treat", y = "outcome", estimand = "simple" )

Run C/S estimation

sg_attgt <- att_gt( yname = "outcome", gname = "first_treat", idname = "pid", tname = "enrolment", control_group = "notyettreated", panel = FALSE, xformla = ~ 1, data = df_staggered, est_method = "dr", bstrap = TRUE, cband = TRUE )

— Reply to this email directly, view it on GitHub https://github.com/jonathandroth/staggered/issues/13#issuecomment-1332510555, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6EXFAD4GFFVK7E6XNRUF3WK6FURANCNFSM6AAAAAASLOKEYM . You are receiving this because you commented.Message ID: @.***>

jonathandroth avatar Nov 30 '22 18:11 jonathandroth

That's true, it's a repeated cross-section. Is the repeated-cross-section type ruled out in your paper? Appendix B in Callaway and Sant'Anna (2021) deals with this case, that is, they explicitly allow for it.

I tried with Inf but it didn't change anything.

Best, Florian

fschoner avatar Nov 30 '22 19:11 fschoner

Yes, the "staggered" paper assumes panel data. If you have repeated cross-sections within a state, you can aggregate to the state level and run everything with the state-level panel. Thanks for flagging this; we should definitely have a better warning in the package if you pass a non-panel dataset.

On Wed, Nov 30, 2022 at 2:01 PM fschoner @.***> wrote:

That's true, it's a repeated cross-section. Is the repeated-cross-section type ruled out in your paper? Appendix B in Callaway and Sant'Anna (2021) deals with this case, that is, they explicitly allow for it.

I tried with Inf but it didn't change anything.

Best, Florian

— Reply to this email directly, view it on GitHub https://github.com/jonathandroth/staggered/issues/13#issuecomment-1332603974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6EXFBZY3NGCKZBOYHOUWDWK6QBHANCNFSM6AAAAAASLOKEYM . You are receiving this because you commented.Message ID: @.***>

jonathandroth avatar Nov 30 '22 19:11 jonathandroth

Ah sorry, I was not aware of it being ruled out explicitly.

Could you please elaborate about the aggregating part? You mean taking the average of the outcome in each (state, enrolment)-cell? It would help me a lot if you could modify my code example.

fschoner avatar Dec 01 '22 09:12 fschoner

Yes, define Y_{st} to be the average outcome in state s in period t. Then you have a balanced panel with T periods and S states.

I am sorry, but I do not have time at the moment to modify your example, but hopefully my description is sufficient.

On Thu, Dec 1, 2022 at 4:42 AM fschoner @.***> wrote:

Ah sorry, I was not aware of it being ruled out explicitly.

Could you please elaborate about the aggregating part? You mean taking the average of the outcome in each (state, enrolment)-cell? It would help me a lot if you could modify my code example.

— Reply to this email directly, view it on GitHub https://github.com/jonathandroth/staggered/issues/13#issuecomment-1333490147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6EXFH6EFPFTGDLMCST2ODWLBXJTANCNFSM6AAAAAASLOKEYM . You are receiving this because you commented.Message ID: @.***>

jonathandroth avatar Dec 01 '22 14:12 jonathandroth

No worries, thanks for elaborating. I still have issues which I'll explain below.

The data looks the following

set.seed(123)
n <- 20000
df_staggered <- data.table(
  pid = 1:n,
  state = sample(c(1:16), replace = TRUE, size = n),
  enrolment = sample(c(1992:2009), replace = TRUE, size = n),
  outcome = rnorm(n)
)
# Create group variable.
df_staggered[
  , 
  first_treat := fcase(
    state == 1, 2007,
    state %in% c(4, 10), 2001,
    state == 7, 2003,
    default = Inf
  ) 
]

Everything is as detailed above: staggered_cs() throws an error while att_gt() runs through smoothly.

Aggregating as you suggested and running staggered_cs:

df_agg <- df_staggered[
  ,
  by = c("state", "enrolment"),
  lapply(.SD, mean, na.rm = TRUE)
]

staggered_cs(
  df_agg,
  i = "state",
  t = "enrolment",
  g = "first_treat",
  y = "outcome",
  estimand = "cohort"
)

This produces a result including the following warning:

    estimate         se  se_neyman
1 -0.05937119 0.06668923 0.06927651
Warning message:
In staggered(df = df, estimand = estimand, A_theta_list = A_theta_list,  :
  The treatment cohorts g = 2003, 2007 have a single cross-sectional unit only. We drop these cohorts. 

As reported, only one state each is treated in 2003 and 2007, respectively. But why is the 2003 cohort dropped? Thinking in plain diff-in-diff terms, there is still a proper control group, namely the state getting treated only in 2007. Why is it dropped still?

Unfortunately, running att_gt()and aggte produces a different result:

sg_attgt_agg <- att_gt(
  yname = "outcome", gname = "first_treat", idname = "state",
  tname = "enrolment", control_group = "notyettreated", panel = TRUE,
  xformla = ~ 1, data = df_agg, est_method = "dr", 
  bstrap = TRUE, cband = TRUE
)
aggte(sg_attgt_agg, type = "group", cband = TRUE, bstrap = TRUE)



Call:
aggte(MP = sg_attgt_agg, type = "group", bstrap = TRUE, cband = TRUE)

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on group/cohort aggregation:  
    ATT    Std. Error     [ 95%  Conf. Int.] 
 0.0236        0.0448    -0.0643      0.1115 

Interestingly, running att_gt produces different results depending on whether I set the default value of first_treat to Inf or NA (but none of the estimates coincides with the output of staggered_cs()). This is probably not intended. I can open an issue in the did package if you want.

fschoner avatar Dec 02 '22 08:12 fschoner

In your example, the 2003 cohort is dropped because -- since there is only one observation -- it is impossible to properly estimate the variance of the potential outcomes for this cohort. We could do a DiD between the 2003 and 2007 cohorts, but if we only have 1 treated observation, we can not properly calculate SEs without strong assumptions.

I would guess that the difference with the did package may relate to how they're treating this case, but I am not sure.

Re the discrepancy between using Inf and NA for first_treat, I would flag this as an issue on the did github.

Thanks!

On Fri, Dec 2, 2022 at 3:38 AM fschoner @.***> wrote:

No worries, thanks for elaborating. I still have issues which I'll explain below.

The data looks the following

set.seed(123) n <- 20000 df_staggered <- data.table( pid = 1:n, state = sample(c(1:16), replace = TRUE, size = n), enrolment = sample(c(1992:2009), replace = TRUE, size = n), outcome = rnorm(n) )

Create group variable.

df_staggered[ , first_treat := fcase( state == 1, 2007, state %in% c(4, 10), 2001, state == 7, 2003, default = Inf ) ]

Everything is as detailed above: staggered_cs() throws an error while att_gt() runs through smoothly.

Aggregating as you suggested and running staggered_cs:

df_agg <- df_staggered[ , by = c("state", "enrolment"), lapply(.SD, mean, na.rm = TRUE) ]

staggered_cs( df_agg, i = "state", t = "enrolment", g = "first_treat", y = "outcome", estimand = "cohort" )

This produces a result including the following warning:

estimate         se  se_neyman

1 -0.05937119 0.06668923 0.06927651 Warning message: In staggered(df = df, estimand = estimand, A_theta_list = A_theta_list, : The treatment cohorts g = 2003, 2007 have a single cross-sectional unit only. We drop these cohorts.

As reported, only one state each is treated in 2003 and 2007, respectively. But why is the 2003 cohort dropped? Thinking in plain diff-in-diff terms, there is still a proper control group, namely the state getting treated only in 2007. Why is it dropped still?

Unfortunately, running att_gt()and aggte produces a different result:

sg_attgt_agg <- att_gt( yname = "outcome", gname = "first_treat", idname = "state", tname = "enrolment", control_group = "notyettreated", panel = TRUE, xformla = ~ 1, data = df_agg, est_method = "dr", bstrap = TRUE, cband = TRUE ) aggte(sg_attgt_agg, type = "group", cband = TRUE, bstrap = TRUE)

Call: aggte(MP = sg_attgt_agg, type = "group", bstrap = TRUE, cband = TRUE)

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. https://doi.org/10.1016/j.jeconom.2020.12.001, https://arxiv.org/abs/1803.09015

Overall summary of ATT's based on group/cohort aggregation: ATT Std. Error [ 95% Conf. Int.] 0.0236 0.0448 -0.0643 0.1115

Interestingly, running att_gt produces different results depending on whether I set the default value of first_treat to Inf or NA (but none of the estimates coincides with the output of staggered_cs()). This is probably not intended. I can open an issue in the did package if you want.

— Reply to this email directly, view it on GitHub https://github.com/jonathandroth/staggered/issues/13#issuecomment-1334910828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6EXFAUYRDWXCC6ZTVAFV3WLGYQNANCNFSM6AAAAAASLOKEYM . You are receiving this because you commented.Message ID: @.***>

jonathandroth avatar Dec 04 '22 21:12 jonathandroth