seaborn
seaborn copied to clipboard
Performance regression in `lineplot` with `ci=None` (or `errorbar=None`)
Hello,
I have noticed that seaborn-0.12.0 takes much more time (see the numbers below) to create a lineplot
without error bars than seaborn-0.11.2 used to.
The following code snippet demonstrates the problem:
import numpy as np
import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt
N = 1000000
x = np.random.randint(0, high=10000000000, size=N)
y = np.random.random(N)
df = DataFrame({"x": x, "y": y})
print(f"Using seaborn-{sns.__version__}:")
plot = sns.lineplot(data=df, x="x", y="y", ci=None)
Measuring the runtime using seaborn-0.11.2 one gets:
$ time python devel/examples/seaborn/slow_lineplot.py
Using seaborn-0.11.2:
real 0m3,046s
user 0m3,302s
sys 0m1,603s
But with seaborn-0.12.0:
$ time python devel/examples/seaborn/slow_lineplot.py
Using seaborn-0.12.0:
/home/palao/devel/examples/seaborn/slow_lineplot.py:35: FutureWarning:
The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.
plot = sns.lineplot(
real 7m9,201s
user 7m7,049s
sys 0m2,920s
Note: replacing ci=None
with errorbar=None
has no impact in the performance of seaborn-0.12.0 (and obviously errorbar
is not a valid keyword parameter in seaborn-0.11.2).
After inspecting the code, it seems to me that even if ci=None
is set (or errorbar=None
) in version 0.12.0, some kind of error bars are computed and that has a huge impact in the performance when the plot has many point. In a real case plot, the runtime to produce it was around one hour, almost 100 times more that it used to be with seaborn-0.11.2
I'm using CPython-3.9.13 and matplotlib-3.5.1
My only solution for now is to pin seaborn==0.11.2
Is this a real bug? Or am I doing/understanding something wrong?
I can reproduce the perf regression here but I don't think the issue is the errorbar
implementation per se (that is it's not computing error bars even though you're asking it not to). Instead I think it's changes to the aggregation logic that support the more generic errorbar interface. If I change your example to N = 10_000
(which is more manageable) I get
%time plot = sns.lineplot(data=df, x="x", y="y", errorbar=None)
CPU times: user 3.78 s, sys: 55.6 ms, total: 3.83 s
Wall time: 3.98 s
But if I do estimator=None
instead of errorbar=None
I get
%time plot = sns.lineplot(data=df, x="x", y="y", estimator=None)
CPU times: user 72.3 ms, sys: 5.93 ms, total: 78.2 ms
Wall time: 88.1 ms
So my guess is that the new implementation is evaluating the estimator function for each unique x
value, whereas before it may have been short-circuiting.
We should try to address the performance regression, but I think you want to put estimator=None
either way.
学习中!
@mwaskom , thank you for the suggestion! And thank you for the great work with seaborn.
Some insight from profiling:
Nearly all of the time spent in _LinePlotter.plot
is spend doing the aggregation:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
378 def plot(self, ax, kws):
...
430 1 0.0 0.0 0.0 if self.estimator is not None:
431 1 0.0 0.0 0.0 if "units" in self.variables:
432 # TODO eventually relax this constraint
433 err = "estimator must be None when specifying units"
434 raise ValueError(err)
435 1 0.1 0.1 0.0 grouped = sub_data.groupby(orient, sort=self.sort)
436 # Could pass as_index=False instead of reset_index,
437 # but that fails on a corner case with older pandas.
438 1 2997.2 2997.2 99.0 sub_data = grouped.apply(agg, other).reset_index()
And then within that method, most of the time is actually spent ... wrapping the outputs in a pandas Series
:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
480 def __call__(self, data, var):
481 """Aggregate over `var` column of `data` with estimate and error interval."""
482 10000 280.8 0.0 9.1 vals = data[var]
483 10000 3.4 0.0 0.1 if callable(self.estimator):
484 # You would think we could pass to vals.agg, and yet:
485 # https://github.com/mwaskom/seaborn/issues/2943
486 estimate = self.estimator(vals)
487 else:
488 10000 1332.1 0.1 43.0 estimate = vals.agg(self.estimator)
489
490 # Options that produce no error bars
491 10000 3.1 0.0 0.1 if self.error_method is None:
492 10000 3.6 0.0 0.1 err_min = err_max = np.nan
...
516 10000 1476.6 0.1 47.6 return pd.Series({var: estimate, f"{var}min": err_min, f"{var}max": err_max})
I think the solution here is just to check whether any aggregation will happen and skip this entire step if it won't. That's implemented in #3081.
There are probably still some edge cases here ... e.g. if you have many unique x
values but a few repeats, it will get past the check and then probably run significantly slower than in v0.11. But that sounds like a situation where estimator=None
is the desired parameterization anyway (and unlike here, results would be incorrect when not used). So I can live with that edge case but someone should feel free to ping if they have a compelling application that is not addressed by this fix.