feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Question about performance

Open pgschr opened this issue 4 years ago • 7 comments

Your library is very cool, but do you have information about performance? For example, how long will it take to MeanMedianImputer process a one million rows dataset? Can I run that on a laptop or I need to go to the cloud? What about processing in parallel? Thanks.

pgschr avatar Nov 28 '20 12:11 pgschr

Hey @pgschr this is a very important question, and we get it a lot. We should measure it. At the moment, we are focusing our strength on the next release, improving code, docs and adding feature selection.

We will pick this one up later.

Thank you!

solegalli avatar Dec 07 '20 08:12 solegalli

Sole, this goes hand in hand with production readiness. Can we make sure that the library doesn't throw strange errors when used under stress?

pgschr avatar Dec 07 '20 17:12 pgschr

@pgschr what sort of strange errors are you referring to?

Re: performance, what benchmark do you suggest? Naturally, this depends on the hardware used.

ChristopherGS avatar Dec 07 '20 19:12 ChristopherGS

@ChristopherGS One of my use cases is to take 100 million rows and populate in a column with missing values the average or the value most used. How many cores and how much memory you need to run this in 10 minutes? Also, what if you need to run the same process on 100 columns? Example of strange errors happen when there are parallel process that concurrently make calculations.

pgschr avatar Dec 07 '20 20:12 pgschr

@pgschr by any chance, did you test this?

Sounds like you are in a good position to test speed of feature-engine on your dataset, and maybe even compare with the transformers from scikit-learn? .i.e., compare speed of MeanMedianImputer of feature-engine vs SimpleImputer from sklearn?

That would be very helpful.

solegalli avatar Dec 08 '20 08:12 solegalli

Sole, I'm in research mode, I haven't installed/learned feature-engine.

It seems that dask is the way to go to perform parallel processing of large volume data frames. Is dask integrated with feature-engine? if not is it on the roadmap?

pgschr avatar Dec 12 '20 18:12 pgschr

Dask is not integrated. And it is not on the roadmap atm. Will consider when we do the speed tests. Cheers Sole

solegalli avatar Dec 14 '20 09:12 solegalli