woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Support converting to Boolean type if column contains yes/no representation

Open gsheni opened this issue 3 years ago • 3 comments

  • As a user, I wish I could use Woodwork to cast a column to Boolean Logical Type, if it contains a yes/no representation (["yes", "no", "no", "yes"] --> [True, False, False, True]).
  • This would save me time as a user and not require me to handle the mapping.
  • Additionally, there are many common ways to represent a yes/no relationship:
[1, "true", "True", "yes", "t", "T"]
[0, "false", "False", "no", "f", "F"]

Code Example

import pandas as pd
import woodwork as ww

df = pd.DataFrame({'col1': ["yes", "no", "no", "yes"})
df.ww.init()
df.ww.set_types(logical_types={
    'col1': 'Boolean'
})
assert df['col1'].equals(pd.Series([True, False, False, True]))

gsheni avatar Mar 16 '21 18:03 gsheni

This issue is related to https://github.com/alteryx/woodwork/issues/153 (but this issue is specifically supporting set_types)

gsheni avatar Mar 16 '21 18:03 gsheni

Another option if we wanted to let users define the boolean values would be with parameters to the Boolean logical type like Boolean(true_value='yes', false_value='no') that, when present, confirm that only those two values are present and then converts to True and False. Or the parameters could define a list of valid true and false values.

This would then mean that you could do this type of conversion in either set_types or at init:

df.ww.init(logical_types={'yes_no': Boolean(true_value='yes', false_value='no')})

vs

df.ww.init()
df.ww.set_types(logical_types={'yes_no': Boolean(true_value='yes', false_value='no')})

Even if we do let Woodwork define the valid boolean types (which would give users the benefit of being able to use the string representation of a logical type), it'll work both at init and set_types as long as we do the check/transformation in _update_column_dtype

tamargrey avatar Mar 17 '21 14:03 tamargrey

Thinking a little more about this: If we eventually wanted to update Woodwork's type inference to infer columns with these kind of values as Boolean (which is another part about what #153 is getting at), it'd be better to have Woodwork be the one to define the valid Boolean values from the beginning.

tamargrey avatar Mar 17 '21 15:03 tamargrey