Better handling of `treated` input in `RegressionDiscontinuity`
When doing regression discontinuity analysis, eg.
result = cp.RegressionDiscontinuity(
df,
formula="y ~ 1 + x + treated + x:treated",
model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
treatment_threshold=0.5,
)
it looks like treated has to be of type bool. A mysterious error arises if it is instead 0's and 1's coded as int's.
- [ ] Add an extra data validation step
- [ ] Add a test to check that we get an exception if we provide
ints
I'd like to solve this. Can you provide a code snippet and error message? Please include definition of the df, and in particular the treated column
Hi @inhandan. Here's a MWE to reproduce the bug:
import causalpy as cp
import pandas as pd
import numpy as np
seed = 42
threshold = 0.5
x = np.random.uniform(0, 1, 100)
treated = np.where(x > threshold, 1, 0) # dtype is int
y = 2 * x + treated + np.random.normal(0, 1, 100)
df = pd.DataFrame({'x': x, 'treated': treated, 'y': y})
assert df["treated"].dtype == "int64"
result = cp.RegressionDiscontinuity(
df,
formula="y ~ 1 + x + treated + x:treated",
model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
treatment_threshold=threshold,
)
But the bug disappears if we set treated as categorical (df["treated"] = pd.Categorical(df["treated"])) or bool (df["treated"] = df["treated"].astype(bool))
I guess the best option is to throw a warning if treated is not categorical or boolean. That puts the onus on the user to ensure the data is being entered correctly. This would probably also be safer and less error prone that trying to coerce treated to categorical or int.
@inhandan have you made any progress with this issue?