formulaic
formulaic copied to clipboard
Column dtypes are not consistent
A model spec does not keep the column type. Example:
import pandas as pd
from formulaic import model_matrix
df1 = pd.DataFrame({
'a': ['A', 'B', 'C'],
})
df2 = pd.DataFrame({
'a': ['A', 'A', 'B'],
})
X1 = model_matrix("a", df1)
X1.info()
gives
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Intercept 3 non-null float64
1 a[T.B] 3 non-null uint8
2 a[T.C] 3 non-null uint8
But then
X2 = X1.model_spec.get_model_matrix(df2)
X2.info()
gives
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Intercept 3 non-null float64
1 a[T.B] 3 non-null uint8
2 a[T.C] 3 non-null float64 # <= This is not the same dtype as before!
Good catch. Casting dtypes is not always safe, though. Would you want a warning through here, or an explicit cast? Or maybe the option to do both?
Hi again @lorentzenchr ! At some point in the past, this seems to be have been resolved (at least for categorical codings). If this recurs, let me know!
It is still is the case that if a is a numerical vector, and the datatype of the vector changes between source dataframes, that the output data type will vary. I think this is fine.
Given that the main issues appears to be resolved, I'm going to close this one out. Feel free to reopen if this recurs!