modin
modin copied to clipboard
`df.apply()` is much faster on DataFrame than Series
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?
The error checking done on DataFrame has been improved for df.apply()
, but that is not the case in Series.
@dchigarev I remember you did something to improve performance of DataFrame.apply
. Can a similar approach be used with Series.apply
?
Sorry, for the delayed response...
The only thing I remember that was done recently specifically for DataFrame.apply
is this PR (#3746), however, it shouldn't affect performance anyhow.
I've been able to reproduce Series.apply
performance problem. In the implementation, there are 3 different code paths: string reduction functions, np.ufunc functions, and arbitrary lambdas, 2 of these paths (np.ufunc
, lambdas) are affected by poor performance.
Reproducer
import modin.pandas as pd
import numpy as np
import timeit
from asv_bench.benchmarks.utils.common import execute
NROWS = 10_000_000
NREPEAT = 30
series_col = pd.Series(np.arange(NROWS))
df_col = pd.DataFrame({"col0": np.arange(NROWS)})
funcs = {
"reduction": "count",
"np.ufunc": np.sqrt,
"lambda": lambda val: val ** 2,
}
def series_apply_no_materialization(fn):
return lambda: series_col.apply(fn)
def series_apply_with_materialization(fn):
def func():
res = series_col.apply(fn)
if hasattr(res, "_query_compiler"):
execute(res)
return func
def df_apply_no_materialization(fn):
return lambda: df_col.apply(fn)
def df_apply_with_materialization(fn):
return lambda: execute(df_col.apply(fn))
for name, fn in funcs.items():
print(f"\n===== {name} =====")
print("Calling 'apply' WITH materialization")
print("\tDataFrame.apply: ", timeit.timeit(df_apply_with_materialization(fn), number=NREPEAT))
print("\tSeries.apply: ", timeit.timeit(series_apply_with_materialization(fn), number=NREPEAT))
print("\nCalling 'apply' WITHOUT materialization")
print("\tDataFrame.apply: ", timeit.timeit(df_apply_no_materialization(fn), number=NREPEAT))
print("\tSeries.apply: ", timeit.timeit(series_apply_no_materialization(fn), number=NREPEAT))
Intel Core i7-1185G7 (8 cores), Windows, PandasOnRay execution:
===== reduction =====
Calling 'apply' WITH materialization
DataFrame.apply: 1.137625400000001
Series.apply: 0.9146318999999998
Calling 'apply' WITHOUT materialization
DataFrame.apply: 0.2765999000000008
Series.apply: 0.9068019000000014 # reduction for Series returns a scalar, so it
# always materializes the result
===== np.ufunc =====
Calling 'apply' WITH materialization
DataFrame.apply: 2.9706434999999978
Series.apply: 5.099930999999998 # `Series.__array_wrap__` defaulting to pandas
Calling 'apply' WITHOUT materialization
DataFrame.apply: 2.862070100000004
Series.apply: 4.849657300000004 # `Series.__array_wrap__` defaulting to pandas
===== lambda =====
Calling 'apply' WITH materialization
DataFrame.apply: 2.1830063000000024
Series.apply: 105.66017339999999 # all cores are being 100% loaded during these 100+ seconds
Calling 'apply' WITHOUT materialization
DataFrame.apply: 3.205889300000024
Series.apply: 110.04947299999998 # all cores are being 100% loaded during these 100+ seconds
-
Reduction: seems to be working fine, no warnings, and the perf is the same as for
df.apply
. The only difference is thatSeries.apply
does not support non-materialization mode since it returns a scalar. -
np.ufunc: UFUNCs are being directly applied to a
modin.pandas.Series
object and this literally causes fallback to pandas (numpy tries to access__array_wrap__
Series attribute which is defaulting to pandas) https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/series.py#L709-L710 -
Lambdas: the most unexplored case for now, the execution goes straight to the
Series.map
branch which is a pure map function that should scale perfectly, but although it utilizes all the cores the performance is really poor. Investigating this case... https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/series.py#L1282-L1288
As far as I understand, the issue is related to poor perf of Series.apply(
===== lambda =====
Calling 'apply' WITH materialization
DataFrame.apply: 4.635541757976171 # note that we submit tasks to Ray workers for the first time so there is a slowdown for the first operation
Series.apply: 0.3690038549830206
Calling 'apply' WITHOUT materialization
DataFrame.apply: 0.3383786199847236
Series.apply: 0.11869432602543384
@dchigarev, could you check if the issue persists on your side?
@dchigarev, note that using number=NREPEAT
you get cumulative time of NREPEAT
runs. I used NREPEAT=1
.
@dchigarev, can you revisit this?