lmdiag
lmdiag copied to clipboard
Slow for large datasets
Thank you for creating a nice package. It is very handy to install such a functionality with pip.
However, I find this package to be rather slow for large datasets in comparison with the LinearRegDiagnostic class described in the Linear regression diagnostics example of statsmodels. This may indicate some inefficiency of the package.
Example:
df = sm.datasets.get_rdataset("ames", "openintro").data
res = smf.ols("np.log10(price) ~ Q('Overall.Qual') + np.log(area)", df).fit()
lmdiag
%%time
lmdiag.plot(res)
CPU times: user 15.1 s, sys: 215 ms, total: 15.3 s
Wall time: 16.1 s
LinearRegDiagnostic
%%time
LinearRegDiagnostic(res)()
CPU times: user 2.17 s, sys: 125 ms, total: 2.29 s
Wall time: 2.17 s
Hi @tueda , thanks for your feedback!
Interesting to see LinearRegDiagnostic. Never seen it before, it certainly wasn't there when I implemented lmdiag 4 years ago.
Looking quickly over their code, they do some calculations smarter than I did!
To be honest, looking at lmdiag now, with 4 years more experience, I think its code quality is really poor. E.g. with the same calculations being repeated for the various plots etc...
Unfortunately, I don't have the sparetime for a rewrite (which would be best). I see some low effort optimizations, which could give ~20% speed-up, and which I might be able to do, but I'm not sure if it's worth it:
I wonder if it makes more sense to archive lmdiag and redirect people to the faster and more feature-rich LinearRegDiagnostic instead?
May I ask, why you would prefer to use lmdiag instead of LinearRegDiagnostic?
(It's a shame, that they implemented this only as an "example" (w/o license?) and not integrated it in statsmodels...)
Interesting to see
LinearRegDiagnostic. Never seen it before, it certainly wasn't there when I implementedlmdiag4 years ago.
Indeed, the example notebook has appeared in v0.14.0, which is rather new.
Unfortunately, I don't have the sparetime for a rewrite (which would be best). I see some low effort optimizations, which could give ~20% speed-up, and which I might be able to do, but I'm not sure if it's worth it:
I understand that you are busy and that it is not easy to devote time to maintaining and improving the library. This is not urgent.
May I ask, why you would prefer to use
lmdiaginstead ofLinearRegDiagnostic?
Personally, I prefer the simple syntax lmdiag.plot(res) to LinearRegDiagnostic(res)(). Another important advantage of lmdiag is that it can be easily installed by pip. It is very handy for use with Google Colab etc.: much better than copying long code from the example notebook.