formulaic icon indicating copy to clipboard operation
formulaic copied to clipboard

Column dtypes are not consistent

Open lorentzenchr opened this issue 3 years ago • 1 comments

A model spec does not keep the column type. Example:

import pandas as pd
from formulaic import model_matrix


df1 = pd.DataFrame({
    'a': ['A', 'B', 'C'],
})

df2 = pd.DataFrame({
    'a': ['A', 'A', 'B'],
})

X1 = model_matrix("a", df1)
X1.info()

gives

 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Intercept  3 non-null      float64
 1   a[T.B]     3 non-null      uint8  
 2   a[T.C]     3 non-null      uint8  

But then

X2 = X1.model_spec.get_model_matrix(df2)
X2.info()

gives

 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Intercept  3 non-null      float64
 1   a[T.B]     3 non-null      uint8  
 2   a[T.C]     3 non-null      float64  # <= This is not the same dtype as before!

lorentzenchr avatar Jan 22 '22 10:01 lorentzenchr

Good catch. Casting dtypes is not always safe, though. Would you want a warning through here, or an explicit cast? Or maybe the option to do both?

matthewwardrop avatar Mar 06 '22 10:03 matthewwardrop

Hi again @lorentzenchr ! At some point in the past, this seems to be have been resolved (at least for categorical codings). If this recurs, let me know!

It is still is the case that if a is a numerical vector, and the datatype of the vector changes between source dataframes, that the output data type will vary. I think this is fine.

Given that the main issues appears to be resolved, I'm going to close this one out. Feel free to reopen if this recurs!

matthewwardrop avatar Jul 03 '23 22:07 matthewwardrop