datatable icon indicating copy to clipboard operation
datatable copied to clipboard

group by (datatable vs pandas)

Open acpguedes opened this issue 3 years ago • 12 comments

Hi,

I tested a group_by operation in a 9.6GB file, my results don't agree with this benchmark.

env: 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Ram 256G

Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

Test: Read

import datatable as dt
%timeit dt.fread("/projects/salmonella/data/cgr.liv.ac.uk/parsed/cgr.parsed.features.tsv")
#42.3 s ± 3.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
import pandas as pd
%timeit pd.read_csv("/projects/salmonella/data/cgr.liv.ac.uk/parsed/cgr.parsed.features.tsv", sep = "\t")
#1min 28s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I also try figure out the conversion from dataframe to pandas:

%timeit dfp = df.to_pandas()                                                                                                                 
57.8 s ± 1.81 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Too expensive conversions, there is others package that converts R data.frame to Pandas really fast.

But the point is, I tried the group_by with count: test: group by

#datatable 
%timeit df[ : , dt.count() , dt.by("nucleotide")]                                                                                                     
6.7 s ± 939 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#panda 
%timeit dfp.groupby("nucleotide").size()                                                                                                     
2.47 s ± 41.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Or there is a better way to perform count by group with datatable?

acpguedes avatar Jul 23 '20 23:07 acpguedes

dt.count is used to count groupings, but this is missing in your code. is it a typo?

samukweku avatar Jul 24 '20 02:07 samukweku

Hi @acpguedes as mentioned by @samukweku you can use dt.count in this way:

df[ : , dt.count() , dt.by("nucleotide")]

Hope this helps.

goldentom42 avatar Jul 24 '20 05:07 goldentom42

Thanks, @samukweku and @goldentom42

Actually it was a typo, but...

%timeit df[ : , dt.count() , dt.by("nucleotide")]                                                                                             
6.88 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is the version on the packages I have

In [8]: pd.__version__                                                                                                                                
Out[8]: '1.0.5'

In [9]: dt.__version__                                                                                                                                
Out[9]: '0.10.1'

And the table has 28383739 rows x 22 columns.

acpguedes avatar Jul 24 '20 16:07 acpguedes

I don't have your hardware specs so I cannot replicate for such a large file. I will try a file on my pc and see how it goes. Surprising that groupby is slower in datatable

samukweku avatar Jul 24 '20 20:07 samukweku

@samukweku the groupby of the benchmark use a sum that I didn't try.

Another point is de conversion from datatable to pandas. Even in a large df, considering another data structure, I don't understand why to spend a long time to convert.

acpguedes avatar Jul 24 '20 20:07 acpguedes

Converting to pandas is something that I believe will be improved over time with regards to speed. Lots of improvements are still possible. If you can, upgrade to the dev version and see if there are any changes for the groupby. With your feedback, I'm sure the datatable team will work to improve the speed

samukweku avatar Jul 24 '20 20:07 samukweku

@samukweku nice, I'll.

I'm not a too advanced programmer but I may contribute as far as I can.

acpguedes avatar Jul 24 '20 20:07 acpguedes

Just ran the count test on a file on my pc and I agree with you, pandas count after grouping is about 4x faster than datatable.

samukweku avatar Jul 24 '20 21:07 samukweku

I would also suggest you change your title to something more descriptive - say count after grouping is slower in datatable than in Pandas. And speed conversion to pandas from datatable is slower compared to some others you know. The premise of datatable is speed; I believe the maintainers would be keen to work on your feedback. You could even test more features and provide feedback; that way the maintainers can identify what to improve on.

samukweku avatar Jul 24 '20 21:07 samukweku

Is there a link to the dataset in question? The performance may depend on characteristics of data, such as column types, number of unique values, etc.

st-pasha avatar Jul 25 '20 01:07 st-pasha

@st-pasha , I used one of the files from this dataset : http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2001.tgz, specifically the Performance_2000Q1.txt - I do not have the hardware specs of @acpguedes , so I did not run the entire 16gb.

samukweku avatar Jul 25 '20 04:07 samukweku

@st-pasha the column type used to groupby is a string. I had no way to share, but, anyway, I only used the df itself for a test. Anyway, t is a gff-like file.

I can run any other test as simple as possible if you send me some df and the commands.

I run for the same data @samukweku used but with Performance_2001Q4.txt_1.

%timeit df[ : , dt.count(), dt.by("C0")]                                                                                                     
7.71 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dfp.groupby("C0").size()                                                                                                             
412 ms ± 4.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is a huge difference. I also expected it to be faster than Pandas as it is in R with data.table.

acpguedes avatar Jul 25 '20 18:07 acpguedes