grf icon indicating copy to clipboard operation
grf copied to clipboard

What to do when the calibration test results cannot satisfy the rules?

Open LuqianSun opened this issue 3 years ago • 6 comments
trafficstars

hi, I am using grf to estimate the heterogeneous treatment effects( no tuning.parameters). Before clustering, the results seem reasonable in terms of mean.forest.prediction.

>   test_calibration(grf_model)

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                               Estimate Std. Error t value Pr(>t)
mean.forest.prediction          0.93592    0.92198  1.0151 0.1550
differential.forest.prediction -1.41913    1.47637 -0.9612 0.8318

However, after clustering by City, the calibration results are totally different.

>   test_calibration(grf_cluster_model)

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                               Estimate Std. Error t value Pr(>t)
mean.forest.prediction           4.1314     3.6483  1.1324 0.1288
differential.forest.prediction  -6.2057     5.4139 -1.1463 0.8741

Any ideas about that?

LuqianSun avatar Jun 24 '22 07:06 LuqianSun

Hi @LuqianSun, you could try tuning the grf_cluster_model by passing tune.parameters = "all" and see if that helps

erikcs avatar Jul 02 '22 02:07 erikcs

Thanks @erikcs, I tried to use the orthogonalized causal forest and cluster it with City. I use tune.parameters = "all". But still it cannot pass the test.

[1] "test_calibration"

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                               Estimate Std. Error t value  Pr(>t)  
mean.forest.prediction           2.3021     1.7255  1.3342 0.09108 .
differential.forest.prediction  -7.8122     7.8204 -0.9990 0.84109  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LuqianSun avatar Jul 06 '22 00:07 LuqianSun

One way to interpret this is that GRF does not detect any HTEs in your clustered setup. Stefan has a lecture covering the intuition behind this kind of check here: https://www.youtube.com/watch?v=fAUmCRgpP6g&t=1240s

Another more "interpretable" way to assess a GRF fit is to look at the TOC/RATE available here: https://grf-labs.github.io/grf/reference/rank_average_treatment_effect.html (an introductory vignette to this kind of metric is here https://grf-labs.github.io/grf/articles/rate.html)

erikcs avatar Jul 06 '22 17:07 erikcs

@erikcs Thanks. On your second point, after reading the links, the qini curve does not tell us anything about grf fit. It is merely about interpreting HTE if there is any in the first place.

LuqianSun avatar Jul 08 '22 03:07 LuqianSun

Also, in this situation, differential.forest.prediction is bigger than 1, do we need to tune the parameters of W.forest and Y.forest

[1] "test_calibration"

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                               Estimate Std. Error t value    Pr(>t)    
mean.forest.prediction          1.12144    0.17416  6.4390  6.47e-11 ***
differential.forest.prediction  2.65782    0.77650  3.4228 0.0003119 ***

like

 forest.W <- regression_forest(X, W,cluster = Clusters,,tune.parameters= tune)

LuqianSun avatar Jul 08 '22 03:07 LuqianSun

@erikcs Thanks. On your second point, after reading the links, the qini curve does not tell us anything about grf fit. It is merely about interpreting HTE if there is any in the first place.

RATE can be seen as a calibration metric just as test_calibration, but easier to interpret. The true CATEs are unobserved, so simple "calibration" metrics like MSE/etc you would use for prediction are not applicable here. This is the motivation behind these new diagnostic measure, to be able tell you something about the CATEs you just estimated, w/o knowing ground truth.

erikcs avatar Jul 21 '22 05:07 erikcs