lmdiag icon indicating copy to clipboard operation
lmdiag copied to clipboard

Slow for large datasets

Open tueda opened this issue 1 year ago • 2 comments
trafficstars

Thank you for creating a nice package. It is very handy to install such a functionality with pip.

However, I find this package to be rather slow for large datasets in comparison with the LinearRegDiagnostic class described in the Linear regression diagnostics example of statsmodels. This may indicate some inefficiency of the package.

Example:

df = sm.datasets.get_rdataset("ames", "openintro").data
res = smf.ols("np.log10(price) ~ Q('Overall.Qual') + np.log(area)", df).fit()

lmdiag

%%time
lmdiag.plot(res)
CPU times: user 15.1 s, sys: 215 ms, total: 15.3 s
Wall time: 16.1 s

LinearRegDiagnostic

%%time
LinearRegDiagnostic(res)()
CPU times: user 2.17 s, sys: 125 ms, total: 2.29 s
Wall time: 2.17 s

tueda avatar Dec 05 '23 11:12 tueda

Hi @tueda , thanks for your feedback!

Interesting to see LinearRegDiagnostic. Never seen it before, it certainly wasn't there when I implemented lmdiag 4 years ago.

Looking quickly over their code, they do some calculations smarter than I did!

To be honest, looking at lmdiag now, with 4 years more experience, I think its code quality is really poor. E.g. with the same calculations being repeated for the various plots etc...

Unfortunately, I don't have the sparetime for a rewrite (which would be best). I see some low effort optimizations, which could give ~20% speed-up, and which I might be able to do, but I'm not sure if it's worth it:

I wonder if it makes more sense to archive lmdiag and redirect people to the faster and more feature-rich LinearRegDiagnostic instead?

May I ask, why you would prefer to use lmdiag instead of LinearRegDiagnostic?

(It's a shame, that they implemented this only as an "example" (w/o license?) and not integrated it in statsmodels...)

dynobo avatar Dec 11 '23 00:12 dynobo

Interesting to see LinearRegDiagnostic. Never seen it before, it certainly wasn't there when I implemented lmdiag 4 years ago.

Indeed, the example notebook has appeared in v0.14.0, which is rather new.

Unfortunately, I don't have the sparetime for a rewrite (which would be best). I see some low effort optimizations, which could give ~20% speed-up, and which I might be able to do, but I'm not sure if it's worth it:

I understand that you are busy and that it is not easy to devote time to maintaining and improving the library. This is not urgent.

May I ask, why you would prefer to use lmdiag instead of LinearRegDiagnostic?

Personally, I prefer the simple syntax lmdiag.plot(res) to LinearRegDiagnostic(res)(). Another important advantage of lmdiag is that it can be easily installed by pip. It is very handy for use with Google Colab etc.: much better than copying long code from the example notebook.

tueda avatar Dec 16 '23 12:12 tueda