EconML
EconML copied to clipboard
Error of crossfit folds splits with DynamicDML
Hi,
I am estimating the effect of high levels of particulate matter (PM2.5) on excess deaths from panel data for 25 municipalities with daily resolution. It means my treatment is a binary variable where T=1, when the level of PM2.5 is high, and T=0, when the level of PM2.5 is low. The outcome is also a binary variable, where Y=0 for non-excess deaths, and Y=1 for excess deaths.
I am using the class DynamicDML to fit my model, but I get this error message: "AttributeError: Provided crossfit folds contain training splits that don't contain all treatments". But, 50% of the data corresponds to observations with T=1, I think it is enough to obtain balanced crossfit folds.
Here is my code with econml version 0.15 and dowhy version 0.10.1 dataset_pm_deaths.csv
` import dowhy import econml from dowhy import CausalModel import pandas as pd import numpy as np from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LassoCV import scipy.stats as stats from itertools import product from econml.utilities import WeightedModelWrapper from sklearn.model_selection import train_test_split from econml.panel.dml import DynamicDML
data_all = pd.read_csv("D:/dataset_pm_deaths.csv") data = data_all[data_all['Year'] >= 2009]
median_pm25 = data['PM25'].median() data['PM25'] = (data['PM25'] >= median_pm25).astype(int)
data.BC = stats.zscore(data.BC, nan_policy='omit') data.DMS = stats.zscore(data.DMS, nan_policy='omit') data.PM = stats.zscore(data.PM, nan_policy='omit') data.OC = stats.zscore(data.OC, nan_policy='omit') data.SO2 = stats.zscore(data.SO2, nan_policy='omit') data.SO4 = stats.zscore(data.SO4, nan_policy='omit')
data0 = data[['excess', 'PM25', 'cod_munici', 'BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature', 'lead1_PM25']] data0 = data0.dropna() Y = data0.excess.to_numpy() T = data0.PM25.to_numpy() percentage_high_PM25 = np.mean(T == 1) * 100 W = data0[['BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature']].to_numpy().reshape(-1, 7) X = data0[['Temperature', 'lead1_PM25']].to_numpy().reshape(-1, 2) groups = data0.cod_munici.to_numpy()
estimate0 = DynamicDML(discrete_treatment=True, featurizer=PolynomialFeatures(degree=3), linear_first_stages=False, cv=3, random_state=123) estimate0.fit(Y=Y, T=T, X=X, W=W, inference='auto', groups=groups) # HERE IS THE ERROR `
Have you tried passing a StratifiedKFold-object or creating your own cv-splitter? That could help you out in the meantime
Hi @TimCosemans
Thanks for your suggestions!