woodwork
woodwork copied to clipboard
Improve inference of booleans to handle string representations
- To improve the boolean inference, we can say that if rows value counts fall into:
-
[1, True, "true", "True", "yes", "t", "T"]
-
[0, False, "false", "False", "no", "f", "F"]
- --> we should infer the Logical Type to be Boolean.
-
- This would prevent the following weird inference, where both columns are inferred to be categorical.
Relates to
- https://github.com/FeatureLabs/woodwork/issues/52
@gsheni can you explain why the example you gave is currently inferred as categorical? I wonder if this should get classified as a bug instead, and the fix is simply to set np.nan
/pd.NA
values aside when we do type inference. Because if you did that, the only remaining value in your example is True
.
Yeah, if I follow this right, my suggestion is:
- Close this issue
- File a bug to track fixing the specific example you gave and attach it to the type inference epic
- File a new feature issue to track "Use sampling during inference" and attach it to the type inference epic
@dsherry My example wasn't clear enough. Let say we had we had some Data Columns like this:
[1, 0, 1, 1]
["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]
["T", "F", "T", "T"]
All of DataColumns should be inferred with the Boolean Logical Type, and converted to the following representation (pd.BooleanDtype).
[True, False, True, True]
If there is np.nan/pd.NA
in the column, it should be ignored when inferring the Logical Type.
Got it. So these
[True, False, True, True, np.nan]
[True, False, True, True, pd.NA]
would also end up as boolean logical type, converted to pd.BooleanDtype
resulting in
[True, False, True, True, pd.NA]
yes?
It occurs to me we'll want the same nan-tolerant behavior when we infer any type, not just booleans, right? Are there other types which we need to address right now? Whoever picks this up, please look into that / add test coverage to look into that :)
@dsherry Yes, we want the NaNs converted properly for Boolean Logical Types.
Though, this issue is more about converting string representations of boolean:
["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]
If we update the inference of booleans to identify series such as [1, 0, 1, 1]
a Boolean
logical type, this series will also be a match for the Integer
logical type. Both of these inference functions will call series.isnull().any()
separately, which could be inefficient for large datasets. As part of the implementation, it would be good to update so this call only needs to happen once, if possible.
See PR #830 for additional context.