woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Improve inference of booleans to handle string representations

Open gsheni opened this issue 4 years ago • 6 comments

  • To improve the boolean inference, we can say that if rows value counts fall into:
    • [1, True, "true", "True", "yes", "t", "T"]
    • [0, False, "false", "False", "no", "f", "F"]
    • --> we should infer the Logical Type to be Boolean.
  • This would prevent the following weird inference, where both columns are inferred to be categorical.

gsheni avatar Sep 25 '20 21:09 gsheni

Relates to

  • https://github.com/FeatureLabs/woodwork/issues/52

gsheni avatar Sep 25 '20 21:09 gsheni

@gsheni can you explain why the example you gave is currently inferred as categorical? I wonder if this should get classified as a bug instead, and the fix is simply to set np.nan/pd.NA values aside when we do type inference. Because if you did that, the only remaining value in your example is True.

Yeah, if I follow this right, my suggestion is:

  • Close this issue
  • File a bug to track fixing the specific example you gave and attach it to the type inference epic
  • File a new feature issue to track "Use sampling during inference" and attach it to the type inference epic

dsherry avatar Nov 05 '20 15:11 dsherry

@dsherry My example wasn't clear enough. Let say we had we had some Data Columns like this:

[1, 0, 1, 1]
["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]
["T", "F", "T", "T"]

All of DataColumns should be inferred with the Boolean Logical Type, and converted to the following representation (pd.BooleanDtype).

[True, False, True, True]

If there is np.nan/pd.NA in the column, it should be ignored when inferring the Logical Type.

gsheni avatar Nov 05 '20 21:11 gsheni

Got it. So these

[True, False, True, True, np.nan]
[True, False, True, True, pd.NA]

would also end up as boolean logical type, converted to pd.BooleanDtype resulting in

[True, False, True, True, pd.NA]

yes?

It occurs to me we'll want the same nan-tolerant behavior when we infer any type, not just booleans, right? Are there other types which we need to address right now? Whoever picks this up, please look into that / add test coverage to look into that :)

dsherry avatar Nov 06 '20 23:11 dsherry

@dsherry Yes, we want the NaNs converted properly for Boolean Logical Types.

Though, this issue is more about converting string representations of boolean:

["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]

gsheni avatar Nov 08 '20 16:11 gsheni

If we update the inference of booleans to identify series such as [1, 0, 1, 1] a Boolean logical type, this series will also be a match for the Integer logical type. Both of these inference functions will call series.isnull().any() separately, which could be inefficient for large datasets. As part of the implementation, it would be good to update so this call only needs to happen once, if possible.

See PR #830 for additional context.

thehomebrewnerd avatar Apr 16 '21 20:04 thehomebrewnerd