modin icon indicating copy to clipboard operation
modin copied to clipboard

`df.apply()` is much faster on DataFrame than Series

Open naren-ponder opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?

The error checking done on DataFrame has been improved for df.apply(), but that is not the case in Series.

naren-ponder avatar Feb 23 '22 21:02 naren-ponder

@dchigarev I remember you did something to improve performance of DataFrame.apply. Can a similar approach be used with Series.apply?

devin-petersohn avatar Feb 23 '22 21:02 devin-petersohn

Sorry, for the delayed response...

The only thing I remember that was done recently specifically for DataFrame.apply is this PR (#3746), however, it shouldn't affect performance anyhow.

I've been able to reproduce Series.apply performance problem. In the implementation, there are 3 different code paths: string reduction functions, np.ufunc functions, and arbitrary lambdas, 2 of these paths (np.ufunc, lambdas) are affected by poor performance.

Reproducer
import modin.pandas as pd
import numpy as np
import timeit

from asv_bench.benchmarks.utils.common import execute

NROWS = 10_000_000
NREPEAT = 30

series_col = pd.Series(np.arange(NROWS))
df_col = pd.DataFrame({"col0": np.arange(NROWS)})

funcs = {
    "reduction": "count",
    "np.ufunc": np.sqrt,
    "lambda": lambda val: val ** 2,
}

def series_apply_no_materialization(fn):
    return lambda: series_col.apply(fn)

def series_apply_with_materialization(fn):
    def func():
        res = series_col.apply(fn)
        if hasattr(res, "_query_compiler"):
            execute(res)
    
    return func

def df_apply_no_materialization(fn):
    return lambda: df_col.apply(fn)

def df_apply_with_materialization(fn):
    return lambda: execute(df_col.apply(fn))

for name, fn in funcs.items():
    print(f"\n===== {name} =====")
    print("Calling 'apply' WITH materialization")
    print("\tDataFrame.apply: ", timeit.timeit(df_apply_with_materialization(fn), number=NREPEAT))
    print("\tSeries.apply: ", timeit.timeit(series_apply_with_materialization(fn), number=NREPEAT))

    print("\nCalling 'apply' WITHOUT materialization")
    print("\tDataFrame.apply: ", timeit.timeit(df_apply_no_materialization(fn), number=NREPEAT))
    print("\tSeries.apply: ", timeit.timeit(series_apply_no_materialization(fn), number=NREPEAT))

Intel Core i7-1185G7 (8 cores), Windows, PandasOnRay execution:

===== reduction =====
Calling 'apply' WITH materialization
        DataFrame.apply:  1.137625400000001
        Series.apply:  0.9146318999999998

Calling 'apply' WITHOUT materialization
        DataFrame.apply:  0.2765999000000008
        Series.apply:  0.9068019000000014 # reduction for Series returns a scalar, so it
					  # always materializes the result

===== np.ufunc =====
Calling 'apply' WITH materialization
        DataFrame.apply:  2.9706434999999978
        Series.apply:  5.099930999999998 # `Series.__array_wrap__` defaulting to pandas

Calling 'apply' WITHOUT materialization
        DataFrame.apply:  2.862070100000004
        Series.apply:  4.849657300000004 # `Series.__array_wrap__` defaulting to pandas

===== lambda =====
Calling 'apply' WITH materialization
        DataFrame.apply:  2.1830063000000024
        Series.apply:  105.66017339999999 # all cores are being 100% loaded during these 100+ seconds

Calling 'apply' WITHOUT materialization
        DataFrame.apply:  3.205889300000024
        Series.apply:  110.04947299999998 # all cores are being 100% loaded during these 100+ seconds
  • Reduction: seems to be working fine, no warnings, and the perf is the same as for df.apply. The only difference is that Series.apply does not support non-materialization mode since it returns a scalar.
  • np.ufunc: UFUNCs are being directly applied to a modin.pandas.Series object and this literally causes fallback to pandas (numpy tries to access __array_wrap__ Series attribute which is defaulting to pandas) https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/series.py#L709-L710
  • Lambdas: the most unexplored case for now, the execution goes straight to the Series.map branch which is a pure map function that should scale perfectly, but although it utilizes all the cores the performance is really poor. Investigating this case... https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/series.py#L1282-L1288

dchigarev avatar Mar 11 '22 15:03 dchigarev

As far as I understand, the issue is related to poor perf of Series.apply(). However, I do not see any time result comparable with @dchigarev's measurements for the case. My numbers are the following :

===== lambda =====
Calling 'apply' WITH materialization
        DataFrame.apply:  4.635541757976171 # note that we submit tasks to Ray workers for the first time so there is a slowdown for the first operation
        Series.apply:  0.3690038549830206

Calling 'apply' WITHOUT materialization
        DataFrame.apply:  0.3383786199847236
        Series.apply:  0.11869432602543384

@dchigarev, could you check if the issue persists on your side?

YarShev avatar Feb 07 '23 20:02 YarShev

@dchigarev, note that using number=NREPEAT you get cumulative time of NREPEAT runs. I used NREPEAT=1.

YarShev avatar Feb 08 '23 20:02 YarShev

@dchigarev, can you revisit this?

YarShev avatar Jan 19 '24 16:01 YarShev