dfply
dfply copied to clipboard
Mutate with boolean expressions
Hi, thank you for all the good work here, I like this the best of the dplyr clones.
In R I am able to do something like,
df %>% mutate(newcol = ifelse(x > 3 & lead(y) < 2, 'yes', 'no')
In Python it seems that I should be using the numpy.where function. I also read enough of your documentation to realize I need to wrap this function in another function with the @make_symbolic decorator. So, I have this:
@make_symbolic
def np_where(bools, val_if_true, val_if_false):
return list(np.where(bools, val_if_true, val_if_false))
When I call it like this, it works just fine:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F', 'Punct, 'Not Punc')
However if I want to make my expression to evaluate to True or False more complex with ands or ors, I get an error:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' & X.CPOS == 'F', 'Punct, 'Not Punc')
also tried with:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' and X.CPOS == 'F', 'Punct, 'Not Punc')
I get this error: TypeError: index returned non-int (type Intention)
I thought that my @make_symbolic decorator took care of this kind of thing. Perhaps I need a logical and that also has the delaying decorator.
I believe this is part of a larger problem: any kind of standard python functions that have not specifically been adapted for dfply do not work. Take for example joining multiple str
columns together:
df >> mutate(new_col = "_".join([col1, col2, col3]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-4cac91689aba> in <module>
----> 1 ap_raw >> mutate(cell_name = "_".join([X.patient, X.cellid]))
TypeError: sequence item 0: expected str instance, Intention found
I think the problem is just that in python the order of operations is different. If you wrapped your conditions with parentheses I believe it would work.
E.g. this works:
from dfply import *
@make_symbolic
def np_where(bools, val_if_true, val_if_false):
return np.where(bools, val_if_true, val_if_false)
df = pd.DataFrame({'cond1' : [0,1], 'cond2' : [1,0]})
df >> mutate(result = np_where((X.cond1 == 1) & (X.cond2 == 1), 5, 2))
Also note that the package seems to have an if_else
function built in. See https://github.com/kieferk/dfply/blob/master/dfply/vector.py. Although it seems to use a list-comprehension instead of np.where, so could potentially be slower than needed.