woodwork
woodwork copied to clipboard
Series with PostalCode logical type can have `float` or `str` elements.
Series with PostalCode logical type can have float
or str
elements.
For example,
ser = pd.Series([12345, 67890]).astype('category')
ser = ww.init_series(ser, logical_type='PostalCode')
In the above code block, the elements of the series are floats, but in the following, they are strings:
ser = pd.Series(["12345", "67890"]).astype('category')
ser = ww.init_series(ser, logical_type='PostalCode')
Both are valid initializations. We should decide whether we want to support both data types for the PostalCode logical type.
This issue was discussed here. https://github.com/alteryx/featuretools/pull/2365
Just to add a little more, I think part of the inconsistent/confusing behavior is if you take a series that has numeric values, but not a category
dtype, and initialize with the PostalCode
logical type, the numeric values get converted to strings:
>>> ser = pd.Series([12345, 67890])
>>> ser = ww.init_series(ser, logical_type='PostalCode')
>>> type(ser[0])
<class 'str'>
But if you start with the same values and set the type as category
before WW init, you end up with numeric values instead of strings:
>>> ser = pd.Series([12345, 67890]).astype("category")
>>> ser = ww.init_series(ser, logical_type='PostalCode')
>>> type(ser[0])
<class 'numpy.int64'>
I believe WW should provide a consistent output in this case, so that no matter the input dtype type we have the same type used in the output after WW initialization.