EconML icon indicating copy to clipboard operation
EconML copied to clipboard

Understanding on Discrete Treatment (p>2) Inference with CausalForestDML

Open AllardJM opened this issue 3 years ago • 1 comments

Hello!

I didn't see any examples where there existed a discrete treatment with multiple values (>2) and a binary outcome. I am hopeful someone can confirm my understanding.

This data set is from a marketing campaign where customers received one of three treatments (https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html):

  • No Email
  • Email about Womans category products
  • Email about Male category products

The outcome I chose here is if the customer visited after the campaign, or not.

Lets say the research question was if the treatment effect depended on the customers prior purchase categories (of which Mens and Womens are binary values in the data)

Here I am setting the treatment to a numeric (1,2,3) for the three categories and using a regression wrapper function to overcome the fact that econml doesnt natively support non-numeric outcomes.

import econml
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
import xgboost

import warnings
warnings.filterwarnings("ignore")

from sklearn.base import BaseEstimator, clone

class RegressionWrapper(BaseEstimator):

    def __init__(self, clf):
        self.clf = clf

    def fit(self, X, y, **kwargs):
        self.clf_ = clone(self.clf)
        self.clf_.fit(X, y, **kwargs)
        return self

    def predict(self, X):
        return self.clf_.predict_proba(X)[:, 1]
    
# read data and create indicator variables    
dat = pd.read_csv('http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')

dat['phone'] = np.where(dat.channel == 'Phone',1,0)
dat['web'] = np.where(dat.channel == 'Web',1,0)
dat['multi'] = np.where(dat.channel == 'Multichannel',1,0)

dat['suburban'] = np.where(dat.zip_code == 'Suburban',1,0)
dat['rural'] = np.where(dat.zip_code == 'Rural',1,0)
dat['urban'] = np.where(dat.zip_code == 'Urban',1,0)

# treatment 
dat['test_numeric'] = 3  # womens
dat['test_numeric'] = np.where(dat.segment == 'No E-Mail',1,dat['test_numeric'].values) # control
dat['test_numeric'] = np.where(dat.segment == 'Mens E-Mail',2,dat['test_numeric'].values) # mens

# train / test split
X_train, X_test, y_train, y_test = train_test_split(dat.drop('visit',axis=1), dat[['visit']], test_size=0.50, random_state=42)

# treatment, confounders / nusiance and two variables of interest
T = X_train['test_numeric']
W = X_train[['phone','web','multi','history','recency']]
X = X_train[['mens','womens']]

# outcome
Y = y_train

#model for the treatments
xgb_model_mc = xgboost.XGBClassifier(objective="multi:softmax", num_class =3, random_state=42)
# model for the outcome
xgb_model = xgboost.XGBClassifier(objective="binary:logistic", random_state=42)


causal_forest = CausalForestDML(criterion='het', 
                                n_estimators=5000,       
                                min_samples_leaf=10, 
                                max_depth=5, 
                                max_samples=0.5,
                                discrete_treatment=True,  # discrete treatments
                                honest=True,
                                inference=True,
                                cv=10,
                                model_t=xgb_model_mc, # model to use for treatments
                                model_y=RegressionWrapper(xgb_model), # model for y
                                )
                      
# fit train data to causal forest model 
causal_forest.fit(Y = Y.values , T = T.values, X = X.values, W = W.values)

The inference for the treatment effect of Womans email versus no email is here (Mens would be simiiar)

#treatment effect (womens email - no email) when the customers purchased......
# 1) only from womens and not mens 
# 2) both womens and mens 
# 3) only mens
# 4) neither 


X = np.array([[0,1],[1,1],[1,0],[0,0]])
infer_result = causal_forest.effect_inference(X =X,T0 =1 , T1 =3 )
result_pd = infer_result.summary_frame()
result_pd.index=['Only Womens', 'Both Mens and Womens', 'Only Mens', 'Neither']
result_pd

and the result:

image

image

Is this the proper way to conduct this analysis using Casual Forest?

AllardJM avatar Jul 12 '21 23:07 AllardJM

@AllardJM , I am also confused how I can interpret the heterogeneous treatment effect point estimate values.

In your example, Treatment is categorical. 'No E-Mail' - 1 'Mens E-Mail' - 2 'WoMens E-Mail' - 3

In your inference, infer_result = causal_forest.effect_inference(X =X,T0 =1 , T1 =3 )

From this discussion issue-676 The treatment effect is the estimated average effect on Y from moving from T=1 to T=3, given X.

Let's consider, first test sample - X.iloc[[0]], the point estimate is 0.074

If we want to describe the effect on this first test sample, if the treatment is changed from T=1 ('No E-Mail') to T=3 ('WoMens E-Mail') then the Y "customer visit" will be increased by 0.074.

Is it correct understanding ?

jaydeepchakraborty avatar Mar 13 '23 05:03 jaydeepchakraborty