mars
mars copied to clipboard
[Discussion] obtain the result of sel in df[sel] fast via JIT
Is your feature request related to a problem? Please describe.
Recently, we want to use JIT compiler (#2754) to accelerate df getitem with bool expressions. But now Mars optimize these operations to eval
operator, while our JIT compiler doesn't have a parser like numexpr
. So I want to get the raw bool expressions directly after Mars fuzed.
# raw code
df = df[df.A > 0.5 and df.B < 0.5]
# Mars optimized code
df = df.eval('A > 0.5 and B < 0.5')
Describe the solution you'd like Mars can split this kind of optimization into two stages:
- analyze df getitem operation, if fusion condition is met, then proivde a raw structure;
- choose different engines to compute it, right now, can provide three engines,
python
,numexpr
,sdc
.
Additional context In which related files is this optimization implemented? I would like to try to provide a PR about this, if I can follow related code. Any advice or suggestion are welcome. @qinxuye @wjsi
Perhaps we need to generate an op which is subclass of mars.core.operand.fuse.Fuse
to represent this computation.
In optimization, this op is generated, it can be handled over to pandas_eval
or other jit engines for execution.
Hi, I looked into the mars code about transforming df[sel]
to df.eval(sel_expr)
.
Right now, I try to implement the similar transformation, with the below format:
# origin
df2 = df[(df.A > 0.5) & (df.B < 0.5)]
# transformed
bool_series = df.apply(lambda s: (s.A > 0.5) & (s.B < 0.5), axis=1, ...)
df2 = df[bool_series]
I expect our JIT compiler (via parallel trick in df.apply) can accelerate the process of obtaining the bool_series
, which can ultimately bring a performance boost. Do u have some thought about this?
This transform to apply
design is kind of tricky.
I tested related code performance on my local machine.
# float version
# normal pandas
def _test_pandas(df):
df1 = df[
(df.A > 0.5)
& (df.B > 0.5)
& (df.C < 0.5)
& (df.D > 0.5)
& (df.E > 0.5)
& (df.F < 0.5)
]
return df1
# df.query
def _test_query(df):
df1 = df.query(
"(A > 0.5) & (B > 0.5) & (C < 0.5) & (D > 0.5) & (E > 0.5) & (F < 0.5)"
)
return df1
# string version
# normal pandas
def _test_pandas(df):
df1 = df[
(df.A == "A")
& (df.B == "B")
& (df.C == "C")
& (df.D == "D")
& (df.E == "E")
& (df.F == "F")
]
return df1
# df.query
def _test_query(df):
a, b, c, d, e, f = "A", "B", "C", "D", "E", "F"
df1 = df.query(
"(A == @a) & (B == @b) & (C == @c) & (D == @d) & (E == @e) & (F == @f)"
)
return df1
different implementation | input data (1000k row * 6 col, float) | input data (1000k row * 6 col, string) |
---|---|---|
normal pandas | 0.026 | 0.41 |
df.query | 0.038 | 0.23 |
jit version of normal pandas | 0.014 | 5.63 |
jit version of apply+getitem | 2.574 | 11.51 |
unit is second.
From the initial result, my transform to apply
design does not work. And our JIT compiler can only accelerate numerical
comparison, while perform bad in string
comparison. I am figuring out the deep reason.
I'm still following this issue... We need to refactor/enhance related parts in our jit compiler first... I will sync when ready.