siuba
siuba copied to clipboard
Series cannot be converted into a numpy array in `mutate` function.
It's embarrassing to admit that I'm struggling with quite a trivial problem.
df = pd.DataFrame({'x': np.arange(10)})
df >> mutate(y = np.asarray(_.x)**2)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last) in
----> 1 df >> mutate(yy = np.asarray(_.x)**2)
~/.local/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
ValueError: invalid __array_struct__
However, using lambda function went smooth.
df >> mutate(y = lambda d: np.asarray(d.x)**2)
Undoubtedly, you don't have to rely on np.asarray()
for this type of simple manipulation. But some external library functions use np.asarray()
internally quite often (in my case, I tried to use np.digitize()
).
Under certain circumstances, complex mutation will be required with the help of external libraries.
Did I miss something basic in using siuba
package? Please, let me know....
Hey @AnselmJeong -- thanks for raising! How to use external functions (rather than pandas methods) is something I'm still puzzling over. The lambda approach seems like the right way to handle it. I'm guessing this will be the most surprising part of siuba users hit (especially if coming from R). I've added some context and links on using external functions below.
Do you remember where you looked for an answer? I'm thinking about adding a note to the quickstart / somewhere early in the docs.
External functions
In general, if you want siuba to be able to optimize grouped operations (see here for rationale) or produce SQL, then you'll need to use either..
- methods and basic operations (e.g. _.some_col.mean() + 1)
- functions designed to return a symbolic expression
But if you are okay with it not being optimizable in a groupby, then you can use a lambda, like in your example (this is a constraint in pandas). I do this pretty often when I know I won't be using a groupby.
Adding custom functions
I started playing around with siuba today and wondered about this kind of limitation. It seems like it should be possible to do something like this:
df >> mutate(y=_(np.sqrt, _.x))
And indeed, you can get rather close very easily:
def call(fn):
@symbolic_dispatch
def _inner(*args):
return fn(*args)
return _inner
df >> mutate(y=call(np.sqrt)(_.x))
But is this any better than just calling a lambda? Presumably you'd still lose the ability to optimize grouped operations and emit SQL.
Surprisingly (to me), this doesn't work, though I'm unsure why exactly:
@symbolic_dispatch
def call(fn, *args):
return fn(*args)
Hey--late to this, but I think the last case doesn't work because...
-
symbolic_dispatch
uses the first argument to decide whether to use your concrete function or create a Symbolic (e.g. what_.some_col
is) - when the first argument is a Symbolic, it returns a symbolic (e.g.
call(_)
) - because
np.sqrt
is not a symbolic,call(np.sqrt)
tries to execute your function (essentiallynp.sqrt()
)
Here's an example of how siuba could implement a version of call (call2
) that always returns a symbolic, using the same pieces symbolic_dispatch does under the hood:
from siuba.data import mtcars
from siuba.siu import symbolic_dispatch, FuncArg, create_sym_call, strip_symbolic
from siuba import _
import numpy as np
@symbolic_dispatch
def call(x, fn, *args):
# this gets called when a non-symbolic form of x is passed
# e.g. a series, etc..
# otherwise, it returns a Symbolic
return fn(x, *args)
def call2(fn, *args):
# this always returns something symbolic
# representation of a function being called (e.g. when called returns the function)
call_func = FuncArg(fn)
# create a symbolic call. this helper function ensures we strip
# any existing symbolic expressions (e.g. _.some_col)
return create_sym_call(call_func, *args)
# Case 1 =======
# executes
call(mtcars.mpg, np.sqrt)
# creates a symbolic
call2(np.sqrt, mtcars.mpg)
# Case 2 =======
# creates symbolic
call(_.mpg, np.sqrt)
# also creates symbolic
call2(np.sqrt, _.mpg)
I hadn't thought much about giving _(...)
a special meaning, but it seems like it could be very useful!