mars icon indicating copy to clipboard operation
mars copied to clipboard

[Discussion] obtain the result of sel in df[sel] fast via JIT

Open dlee992 opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. Recently, we want to use JIT compiler (#2754) to accelerate df getitem with bool expressions. But now Mars optimize these operations to eval operator, while our JIT compiler doesn't have a parser like numexpr. So I want to get the raw bool expressions directly after Mars fuzed.

# raw code
df  = df[df.A > 0.5 and df.B < 0.5] 
# Mars optimized code
df = df.eval('A > 0.5 and B < 0.5')

Describe the solution you'd like Mars can split this kind of optimization into two stages:

  1. analyze df getitem operation, if fusion condition is met, then proivde a raw structure;
  2. choose different engines to compute it, right now, can provide three engines, python, numexpr, sdc.

Additional context In which related files is this optimization implemented? I would like to try to provide a PR about this, if I can follow related code. Any advice or suggestion are welcome. @qinxuye @wjsi

dlee992 avatar Mar 15 '22 06:03 dlee992

Perhaps we need to generate an op which is subclass of mars.core.operand.fuse.Fuse to represent this computation.

In optimization, this op is generated, it can be handled over to pandas_eval or other jit engines for execution.

qinxuye avatar Mar 15 '22 06:03 qinxuye

Hi, I looked into the mars code about transforming df[sel] to df.eval(sel_expr).

Right now, I try to implement the similar transformation, with the below format:

# origin
df2  = df[(df.A > 0.5) & (df.B < 0.5)]

# transformed
bool_series = df.apply(lambda s: (s.A > 0.5) & (s.B < 0.5), axis=1, ...)
df2 = df[bool_series]

I expect our JIT compiler (via parallel trick in df.apply) can accelerate the process of obtaining the bool_series, which can ultimately bring a performance boost. Do u have some thought about this?

This transform to apply design is kind of tricky.

dlee992 avatar Mar 22 '22 08:03 dlee992

I tested related code performance on my local machine.

# float version 
# normal pandas
def _test_pandas(df):
    df1 = df[
        (df.A > 0.5)
        & (df.B > 0.5)
        & (df.C < 0.5)
        & (df.D > 0.5)
        & (df.E > 0.5)
        & (df.F < 0.5)
    ]
    return df1

# df.query
def _test_query(df):
    df1 = df.query(
        "(A > 0.5) & (B > 0.5) & (C < 0.5) & (D > 0.5) & (E > 0.5) & (F < 0.5)"
    )
    return df1

# string version
# normal pandas
def _test_pandas(df):
    df1 = df[
        (df.A == "A")
        & (df.B == "B")
        & (df.C == "C")
        & (df.D == "D")
        & (df.E == "E")
        & (df.F == "F")
    ]
    return df1

# df.query
def _test_query(df):
    a, b, c, d, e, f = "A", "B", "C", "D", "E", "F"
    df1 = df.query(
        "(A == @a) & (B == @b) & (C == @c) & (D == @d) & (E == @e) & (F == @f)"
    )
    return df1
different implementation input data (1000k row * 6 col, float) input data (1000k row * 6 col, string)
normal pandas 0.026 0.41
df.query 0.038 0.23
jit version of normal pandas 0.014 5.63
jit version of apply+getitem 2.574 11.51

unit is second.

From the initial result, my transform to apply design does not work. And our JIT compiler can only accelerate numerical comparison, while perform bad in string comparison. I am figuring out the deep reason.

dlee992 avatar Mar 23 '22 10:03 dlee992

I'm still following this issue... We need to refactor/enhance related parts in our jit compiler first... I will sync when ready.

dlee992 avatar Apr 01 '22 09:04 dlee992