EconML
EconML copied to clipboard
Understanding on Discrete Treatment (p>2) Inference with CausalForestDML
Hello!
I didn't see any examples where there existed a discrete treatment with multiple values (>2) and a binary outcome. I am hopeful someone can confirm my understanding.
This data set is from a marketing campaign where customers received one of three treatments (https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html):
- No Email
- Email about Womans category products
- Email about Male category products
The outcome I chose here is if the customer visited after the campaign, or not.
Lets say the research question was if the treatment effect depended on the customers prior purchase categories (of which Mens and Womens are binary values in the data)
Here I am setting the treatment to a numeric (1,2,3) for the three categories and using a regression wrapper function to overcome the fact that econml doesnt natively support non-numeric outcomes.
import econml
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
import xgboost
import warnings
warnings.filterwarnings("ignore")
from sklearn.base import BaseEstimator, clone
class RegressionWrapper(BaseEstimator):
def __init__(self, clf):
self.clf = clf
def fit(self, X, y, **kwargs):
self.clf_ = clone(self.clf)
self.clf_.fit(X, y, **kwargs)
return self
def predict(self, X):
return self.clf_.predict_proba(X)[:, 1]
# read data and create indicator variables
dat = pd.read_csv('http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')
dat['phone'] = np.where(dat.channel == 'Phone',1,0)
dat['web'] = np.where(dat.channel == 'Web',1,0)
dat['multi'] = np.where(dat.channel == 'Multichannel',1,0)
dat['suburban'] = np.where(dat.zip_code == 'Suburban',1,0)
dat['rural'] = np.where(dat.zip_code == 'Rural',1,0)
dat['urban'] = np.where(dat.zip_code == 'Urban',1,0)
# treatment
dat['test_numeric'] = 3 # womens
dat['test_numeric'] = np.where(dat.segment == 'No E-Mail',1,dat['test_numeric'].values) # control
dat['test_numeric'] = np.where(dat.segment == 'Mens E-Mail',2,dat['test_numeric'].values) # mens
# train / test split
X_train, X_test, y_train, y_test = train_test_split(dat.drop('visit',axis=1), dat[['visit']], test_size=0.50, random_state=42)
# treatment, confounders / nusiance and two variables of interest
T = X_train['test_numeric']
W = X_train[['phone','web','multi','history','recency']]
X = X_train[['mens','womens']]
# outcome
Y = y_train
#model for the treatments
xgb_model_mc = xgboost.XGBClassifier(objective="multi:softmax", num_class =3, random_state=42)
# model for the outcome
xgb_model = xgboost.XGBClassifier(objective="binary:logistic", random_state=42)
causal_forest = CausalForestDML(criterion='het',
n_estimators=5000,
min_samples_leaf=10,
max_depth=5,
max_samples=0.5,
discrete_treatment=True, # discrete treatments
honest=True,
inference=True,
cv=10,
model_t=xgb_model_mc, # model to use for treatments
model_y=RegressionWrapper(xgb_model), # model for y
)
# fit train data to causal forest model
causal_forest.fit(Y = Y.values , T = T.values, X = X.values, W = W.values)
The inference for the treatment effect of Womans email versus no email is here (Mens would be simiiar)
#treatment effect (womens email - no email) when the customers purchased......
# 1) only from womens and not mens
# 2) both womens and mens
# 3) only mens
# 4) neither
X = np.array([[0,1],[1,1],[1,0],[0,0]])
infer_result = causal_forest.effect_inference(X =X,T0 =1 , T1 =3 )
result_pd = infer_result.summary_frame()
result_pd.index=['Only Womens', 'Both Mens and Womens', 'Only Mens', 'Neither']
result_pd
and the result:
Is this the proper way to conduct this analysis using Casual Forest?
@AllardJM , I am also confused how I can interpret the heterogeneous treatment effect point estimate values.
In your example, Treatment is categorical. 'No E-Mail' - 1 'Mens E-Mail' - 2 'WoMens E-Mail' - 3
In your inference, infer_result = causal_forest.effect_inference(X =X,T0 =1 , T1 =3 )
From this discussion issue-676 The treatment effect is the estimated average effect on Y from moving from T=1 to T=3, given X.
Let's consider, first test sample - X.iloc[[0]], the point estimate is 0.074
If we want to describe the effect on this first test sample, if the treatment is changed from T=1 ('No E-Mail') to T=3 ('WoMens E-Mail') then the Y "customer visit" will be increased by 0.074.
Is it correct understanding ?