seaborn
seaborn copied to clipboard
regplot: support confidence intervals with lowess model
Current regplot
behavior when lowess=True
is to ignore the confidence interval (ci
) and bootstrap (n_boot
) keyword arguments (example below):
sns.regplot('obs', 'mod', data=data, lowess=True, ci=95, n_boot=1000)
Any thoughts on making it possible to bootstrap the lowess model?
Hello: First of all I would like to congratulate the authors on a wonderful library. May I ask, what your plans regarding this question is? I am using the most uptodate seaborn version through conda. Thanks again for a beautiful tool, Markus
Is there interest in adding the confidence intervals to the lowess fits? I find them informative when looking at data. In the past, I have used the skmisc.loess library to achieve this, but it would be nice to incorporate it into seaborn. After digging around in the issues, it seems like this hasn't been incorporated because of slow performance in boot strapping in statsmodels.
Bootstrapping the loess fits was not performant enough in the testing that was done when this came up previously.
I would strongly prefer not to add another dependency, so the best path forward would be to incorporate the confidence interval code from that library into statsmodels.
Another possibility would be to add smooth regression using statsmodels GAMs rather than loess.
That makes sense to me. I will look into those options and can open this up again if I find any useful.
See also some related comments here: https://github.com/mwaskom/seaborn/issues/2351
Any path forward on lowess/smoothfit improvements will likely take the form of a new dedicated function that makes it easier to parameterize and use a different default approach to error bars ... too much packed into regplot
at the moment.
Is there any update on confidence intervals for loess or another smooth regression in regplot
?
Looking through the codebase it easy to see how to use the same tools to simply enable this bootstrap functionality.
import numpy as np
import seaborn as sns
from statsmodels.nonparametric.smoothers_lowess import lowess
def regplot_lowess_ci(data, x, y, ci_level, n_boot, **kwargs):
x_ = data[x].to_numpy()
y_ = data[y].to_numpy()
x_grid = np.linspace(start=x_.min(), stop=x_.max(), num=1000)
def reg_func(_x, _y):
return lowess(exog=_x, endog=_y, xvals=x_grid)
beta_boots = sns.algorithms.bootstrap(
x_, y_,
func=reg_func,
n_boot=n_boot,
)
err_bands = sns.utils.ci(beta_boots, ci_level, axis=0)
y_plt = reg_func(x_, y_)
ax = sns.lineplot(x=x_grid, y=y_plt, **kwargs)
sns.scatterplot(x=x_, y=y_, ax=ax, **kwargs)
ax.fill_between(x_grid, *err_bands, alpha=.15, **kwargs)
return ax
mpg_df = sns.load_dataset('mpg')
ax = regplot_lowess_ci(mpg_df, x='mpg', y='acceleration', ci_level=99, n_boot=100)
ax.figure.show()
Then, it can also be easily used with facet_grid
to replicate lmplot
mpg_df.eval('heavy = weight>2803', inplace=True)
grid = sns.FacetGrid(mpg_df, col='heavy', sharex=False)
grid.map_dataframe(regplot_lowess_ci, x='mpg', y='acceleration', ci_level=99, n_boot=100)
grid.figure.show()
@mwaskom on my machine this runs in seconds. While sns.scatterplot
regularly chokes on a few thousands of data points. I am not sure "users might be waiting for a long time for plotting to finish" is a good argument to disable ci
option for lowess
altogether.
Imo, a warning, or a default n_boot=100 if lowess==True
would be a better solution.
I'd be willing to incorporate the above workaround into the actual codebase, e.g. adapt this function https://github.com/mwaskom/seaborn/blob/master/seaborn/regression.py#L296
And change this line to a warning or a defaults handler: https://github.com/mwaskom/seaborn/blob/master/seaborn/regression.py#L211
and submit a PR