seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

Performance regression in `lineplot` with `ci=None` (or `errorbar=None`)

Open palao opened this issue 2 years ago • 3 comments

Hello, I have noticed that seaborn-0.12.0 takes much more time (see the numbers below) to create a lineplot without error bars than seaborn-0.11.2 used to.

The following code snippet demonstrates the problem:

import numpy as np
import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt

N = 1000000
x = np.random.randint(0, high=10000000000, size=N)
y = np.random.random(N)
df = DataFrame({"x": x, "y": y})

print(f"Using seaborn-{sns.__version__}:")

plot = sns.lineplot(data=df, x="x", y="y", ci=None)

Measuring the runtime using seaborn-0.11.2 one gets:

$ time python devel/examples/seaborn/slow_lineplot.py 
Using seaborn-0.11.2:

real    0m3,046s
user    0m3,302s
sys     0m1,603s

But with seaborn-0.12.0:

$ time python devel/examples/seaborn/slow_lineplot.py 
Using seaborn-0.12.0:
/home/palao/devel/examples/seaborn/slow_lineplot.py:35: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  plot = sns.lineplot(

real    7m9,201s
user    7m7,049s
sys     0m2,920s

Note: replacing ci=None with errorbar=None has no impact in the performance of seaborn-0.12.0 (and obviously errorbar is not a valid keyword parameter in seaborn-0.11.2).

After inspecting the code, it seems to me that even if ci=None is set (or errorbar=None) in version 0.12.0, some kind of error bars are computed and that has a huge impact in the performance when the plot has many point. In a real case plot, the runtime to produce it was around one hour, almost 100 times more that it used to be with seaborn-0.11.2

I'm using CPython-3.9.13 and matplotlib-3.5.1

My only solution for now is to pin seaborn==0.11.2

Is this a real bug? Or am I doing/understanding something wrong?

palao avatar Sep 08 '22 12:09 palao

I can reproduce the perf regression here but I don't think the issue is the errorbar implementation per se (that is it's not computing error bars even though you're asking it not to). Instead I think it's changes to the aggregation logic that support the more generic errorbar interface. If I change your example to N = 10_000 (which is more manageable) I get

%time plot = sns.lineplot(data=df, x="x", y="y", errorbar=None)
CPU times: user 3.78 s, sys: 55.6 ms, total: 3.83 s
Wall time: 3.98 s

But if I do estimator=None instead of errorbar=None I get

%time plot = sns.lineplot(data=df, x="x", y="y", estimator=None)
CPU times: user 72.3 ms, sys: 5.93 ms, total: 78.2 ms
Wall time: 88.1 ms

So my guess is that the new implementation is evaluating the estimator function for each unique x value, whereas before it may have been short-circuiting.

We should try to address the performance regression, but I think you want to put estimator=None either way.

mwaskom avatar Sep 08 '22 23:09 mwaskom

学习中!

jianghui1229 avatar Sep 09 '22 07:09 jianghui1229

@mwaskom , thank you for the suggestion! And thank you for the great work with seaborn.

palao avatar Sep 12 '22 11:09 palao

Some insight from profiling:

Nearly all of the time spent in _LinePlotter.plot is spend doing the aggregation:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   378                                               def plot(self, ax, kws):
   ...
   430         1          0.0      0.0      0.0              if self.estimator is not None:
   431         1          0.0      0.0      0.0                  if "units" in self.variables:
   432                                                               # TODO eventually relax this constraint
   433                                                               err = "estimator must be None when specifying units"
   434                                                               raise ValueError(err)
   435         1          0.1      0.1      0.0                  grouped = sub_data.groupby(orient, sort=self.sort)
   436                                                           # Could pass as_index=False instead of reset_index,
   437                                                           # but that fails on a corner case with older pandas.
   438         1       2997.2   2997.2     99.0                  sub_data = grouped.apply(agg, other).reset_index()

And then within that method, most of the time is actually spent ... wrapping the outputs in a pandas Series:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   480                                               def __call__(self, data, var):
   481                                                   """Aggregate over `var` column of `data` with estimate and error interval."""
   482     10000        280.8      0.0      9.1          vals = data[var]
   483     10000          3.4      0.0      0.1          if callable(self.estimator):
   484                                                       # You would think we could pass to vals.agg, and yet:
   485                                                       # https://github.com/mwaskom/seaborn/issues/2943
   486                                                       estimate = self.estimator(vals)
   487                                                   else:
   488     10000       1332.1      0.1     43.0              estimate = vals.agg(self.estimator)
   489                                           
   490                                                   # Options that produce no error bars
   491     10000          3.1      0.0      0.1          if self.error_method is None:
   492     10000          3.6      0.0      0.1              err_min = err_max = np.nan
   ...
   516     10000       1476.6      0.1     47.6          return pd.Series({var: estimate, f"{var}min": err_min, f"{var}max": err_max})

I think the solution here is just to check whether any aggregation will happen and skip this entire step if it won't. That's implemented in #3081.

There are probably still some edge cases here ... e.g. if you have many unique x values but a few repeats, it will get past the check and then probably run significantly slower than in v0.11. But that sounds like a situation where estimator=None is the desired parameterization anyway (and unlike here, results would be incorrect when not used). So I can live with that edge case but someone should feel free to ping if they have a compelling application that is not addressed by this fix.

mwaskom avatar Oct 13 '22 00:10 mwaskom