woodwork
woodwork copied to clipboard
Support converting to Boolean type if column contains yes/no representation
- As a user, I wish I could use Woodwork to cast a column to Boolean Logical Type, if it contains a yes/no representation (
["yes", "no", "no", "yes"]
-->[True, False, False, True]
). - This would save me time as a user and not require me to handle the mapping.
- Additionally, there are many common ways to represent a yes/no relationship:
[1, "true", "True", "yes", "t", "T"]
[0, "false", "False", "no", "f", "F"]
Code Example
import pandas as pd
import woodwork as ww
df = pd.DataFrame({'col1': ["yes", "no", "no", "yes"})
df.ww.init()
df.ww.set_types(logical_types={
'col1': 'Boolean'
})
assert df['col1'].equals(pd.Series([True, False, False, True]))
This issue is related to https://github.com/alteryx/woodwork/issues/153 (but this issue is specifically supporting set_types
)
Another option if we wanted to let users define the boolean values would be with parameters to the Boolean
logical type like Boolean(true_value='yes', false_value='no')
that, when present, confirm that only those two values are present and then converts to True and False. Or the parameters could define a list of valid true and false values.
This would then mean that you could do this type of conversion in either set_types
or at init
:
df.ww.init(logical_types={'yes_no': Boolean(true_value='yes', false_value='no')})
vs
df.ww.init()
df.ww.set_types(logical_types={'yes_no': Boolean(true_value='yes', false_value='no')})
Even if we do let Woodwork define the valid boolean types (which would give users the benefit of being able to use the string representation of a logical type), it'll work both at init
and set_types
as long as we do the check/transformation in _update_column_dtype
Thinking a little more about this: If we eventually wanted to update Woodwork's type inference to infer columns with these kind of values as Boolean
(which is another part about what #153 is getting at), it'd be better to have Woodwork be the one to define the valid Boolean values from the beginning.