ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Pandas profiling becoming too slow : un-usable

Open arita37 opened this issue 4 years ago • 10 comments

With 10k rows and 30 columns, it takes more than 2mins to generate a report...

Pandas profiking becomes more and more slow...

Can you run benchmark tests ?

arita37 avatar Mar 31 '21 08:03 arita37

Could you provide a dataset to reproduce? Did you test against prior versions of this package?

sbrugman avatar Apr 02 '21 09:04 sbrugman

Dataset is simple :

40k rows 37 columns of float.

You can try add sklearn random dataset as your regression tests.

On Apr 2, 2021, at 18:41, Simon Brugman @.***> wrote:

 Could you provide a dataset to reproduce? Did you test against prior versions of this package?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

arita37 avatar Apr 02 '21 10:04 arita37

What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.

sbrugman avatar Apr 02 '21 10:04 sbrugman

Ok, How much time for minimal= False ???

My suggestions :

  1. Run speed benchmark when releasing With nrows= 10,000 and ncolumns=50 (ie a small dataset)

  2. Find tricks / optional to remove heavy compute —> Mostly pairwise

Mostly : ratio =Nb of unqiue values in (5%, 95%) / len(df) if ratio is low : might skip

On Apr 2, 2021, at 19:50, Simon Brugman @.***> wrote:

 What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

arita37 avatar Apr 02 '21 14:04 arita37

Suggest to use

pyinstrument to benchmark

You’ll see which functionalities to de-activate by default.

On Apr 2, 2021, at 23:53, No Ke @.***> wrote:

 Ok, How much time for minimal= False ???

My suggestions :

  1. Run speed benchmark when releasing With nrows= 10,000 and ncolumns=50 (ie a small dataset)

  2. Find tricks / optional to remove heavy compute —> Mostly pairwise

Mostly : ratio =Nb of unqiue values in (5%, 95%) / len(df) if ratio is low : might skip

On Apr 2, 2021, at 19:50, Simon Brugman @.***> wrote:

 What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

arita37 avatar Apr 02 '21 15:04 arita37

@arita37 Sounds good. Would you be interested in contributing a PR and work out the sketched solution?

sbrugman avatar Apr 02 '21 16:04 sbrugman

Sorry, am too busy... Check my profile and count the number of commits done....

It takes 5 lines of code

Random dataset numpy random: 10000 rows x 50cols run pandas profiling with py instrument identifiy bottlenack

Put optional all compute > 40sec.

Make the total compute < 60 sec.

the more bloated you put pandas profiling the less usable it becomes....

Dont spend time on useless jupyter widgets, al.

Make sure the code is running fast —> people will always use it because it would be faster than using jupyter....

HTML report replaces jupyter itself

On Apr 3, 2021, at 1:16, Simon Brugman @.***> wrote:

 @arita37 Sounds good. Would you be interested in contributing a PR and work out the sketched solution?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

arita37 avatar Apr 02 '21 18:04 arita37

I ran a 49 million record table with 86 variables. Yeah, it took 30 minutes to run, but I'm good with that because the data insights I got from the report are an incredible help.

I turned off correlations, interactions, and closed all the missing diagrams.

dpnem avatar Apr 13 '21 14:04 dpnem

Are correlations, interactions, and missing diagrams the most expensive in terms of computation? Could you throw in additional insights about what are the tasks which are computationally expensive during report generation?

akshayreddykotha avatar May 26 '21 23:05 akshayreddykotha

@dpnem Would you share harware specs of computer that you run profiling on ?

enesMesut avatar Sep 21 '21 08:09 enesMesut

too slow...

zcfrank1st avatar May 12 '23 02:05 zcfrank1st