EconML icon indicating copy to clipboard operation
EconML copied to clipboard

[Bug] Problem with features in model_y with SparseDML

Open sami-ka opened this issue 3 years ago • 0 comments

@kbattocchi Following my previous issue (#648) , I checked with version 0.13.1 and it seemed that there is another problem with SparseDML.

Here is a code snippet I used with version 0.13.1 :

est_sparsedml = SparseLinearDML(model_y=RandomForestRegressor(),
                       model_t=MultiOutputRegressor(RandomForestRegressor(min_samples_leaf=10)),
                       cv=2)

est_dml = DML(model_y=RandomForestRegressor(),
                       model_t=MultiOutputRegressor(RandomForestRegressor(min_samples_leaf=10)),
           model_final=DebiasedLasso(
            ),
                       cv=2)

est_sparsedml.fit(Y_econ, T_econ, X=X_econ, W=None)
est_dml.fit(Y_econ, T_econ, X=X_econ, W=None)
print('X dim :', X_econ.shape, 'Y dim :', Y_econ.shape, 'T dim :', T_econ.shape)
print('Features in for first stage model y SparseDML:', est_sparsedml.models_y[0][0].n_features_in_)
print('Features in for first stage model y DML:', est_dml.models_y[0][0].n_features_in_)

and the output :

X dim : (1356, 3) Y dim : (1356,) T dim : (1356, 2) Features in for first stage model y SparseDML: 12 Features in for first stage model y DML: 3

You can see that when I use SparseDML somehow the dimension of X is modified. I deep dived into the code and I think it comes from the combine function of FirstStageWrapper class (dml.py), at line 57 specifically with the cross product function.

This is because there is no featurizer so variable F is set to X and then it is used for the cross product. I understand that it is multiplying X (dim = nx3) by (1+X) ((dim = nx4)) and thus creating this n x (3*4) input features matrix.

It comes from the fact that linear_first_stages is set to True. In my opinion solving the the automatic assignment of this value needs to be adressed and would "solve" the problem at hand. Here is the result when linear_first_stages is explicitly set to False :

est_sparsedml = SparseLinearDML(model_y=RandomForestRegressor(),
                       model_t=MultiOutputRegressor(RandomForestRegressor(min_samples_leaf=10)),
                       cv=2)

est_sparsedml_linear_false = SparseLinearDML(model_y=RandomForestRegressor(),
                       model_t=MultiOutputRegressor(RandomForestRegressor(min_samples_leaf=10)),
                                linear_first_stages=False,
                       cv=2)
est_dml = DML(model_y=RandomForestRegressor(),
                       model_t=MultiOutputRegressor(RandomForestRegressor(min_samples_leaf=10)),
           model_final=DebiasedLasso(
            ),
                       cv=2)

est_sparsedml.fit(Y_econ, T_econ, X=X_econ, W=None)
est_sparsedml_linear_false.fit(Y_econ, T_econ, X=X_econ, W=None)
est_dml.fit(Y_econ, T_econ, X=X_econ, W=None)
print('X dim :', X_econ.shape, 'Y dim :', Y_econ.shape, 'T dim :', T_econ.shape)
print('Features in for first stage model y SparseDML:', est_sparsedml.models_y[0][0].n_features_in_)
print('Features in for first stage model y SparseDML linear stage=False:', est_sparsedml_linear_false.models_y[0][0].n_features_in_)
print('Features in for first stage model y DML:', est_dml.models_y[0][0].n_features_in_)

X dim : (1356, 3) Y dim : (1356,) T dim : (1356, 2) Features in for first stage model y SparseDML: 12 Features in for first stage model y SparseDML linear stage=False: 3 Features in for first stage model y DML: 3

However, if I had indeed had linear models as first stage, I still would have had my problem. In addition to this, I am not sure to understand why linear stages come into play in the combine function so I would have rewrote it that way :

class _FirstStageWrapper:
    def __init__(self, model, is_Y, featurizer, linear_first_stages, discrete_treatment):
        self._model = clone(model, safe=False)
        self._featurizer = clone(featurizer, safe=False)
        self._is_Y = is_Y
        self._linear_first_stages = linear_first_stages
        self._discrete_treatment = discrete_treatment

    def _combine(self, X, W, n_samples, fitting=True):
        if X is None:
            # if both X and W are None, just return a column of ones
            return (W if W is not None else np.ones((n_samples, 1)))

        if self._is_Y:
            if self._featurizer is not None:
                F = self._featurizer.fit_transform(X) if fitting else self._featurizer.transform(X)
                X = cross_product(X, hstack([np.ones((shape(X)[0], 1)), F])
       XW = hstack([X, W]) if W is not None else X
       return XW

sami-ka avatar Jul 08 '22 09:07 sami-ka