datatable
datatable copied to clipboard
group by (datatable vs pandas)
Hi,
I tested a group_by
operation in a 9.6GB file, my results don't agree with this benchmark.
env: 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Ram 256G
Python 3.7.1 (default, Dec 14 2018, 19:28:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.
Test: Read
import datatable as dt
%timeit dt.fread("/projects/salmonella/data/cgr.liv.ac.uk/parsed/cgr.parsed.features.tsv")
#42.3 s ± 3.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
import pandas as pd
%timeit pd.read_csv("/projects/salmonella/data/cgr.liv.ac.uk/parsed/cgr.parsed.features.tsv", sep = "\t")
#1min 28s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also try figure out the conversion from dataframe
to pandas
:
%timeit dfp = df.to_pandas()
57.8 s ± 1.81 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Too expensive conversions, there is others package that converts R data.frame to Pandas really fast.
But the point is, I tried the group_by
with count
:
test: group by
#datatable
%timeit df[ : , dt.count() , dt.by("nucleotide")]
6.7 s ± 939 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#panda
%timeit dfp.groupby("nucleotide").size()
2.47 s ± 41.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Or there is a better way to perform count by group with datatable
?
dt.count
is used to count groupings, but this is missing in your code. is it a typo?
Hi @acpguedes as mentioned by @samukweku you can use dt.count
in this way:
df[ : , dt.count() , dt.by("nucleotide")]
Hope this helps.
Thanks, @samukweku and @goldentom42
Actually it was a typo, but...
%timeit df[ : , dt.count() , dt.by("nucleotide")]
6.88 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is the version on the packages I have
In [8]: pd.__version__
Out[8]: '1.0.5'
In [9]: dt.__version__
Out[9]: '0.10.1'
And the table has 28383739 rows x 22 columns.
I don't have your hardware specs so I cannot replicate for such a large file. I will try a file on my pc and see how it goes. Surprising that groupby is slower in datatable
@samukweku the groupby
of the benchmark use a sum that I didn't try.
Another point is de conversion from datatable to pandas. Even in a large df, considering another data structure, I don't understand why to spend a long time to convert.
Converting to pandas is something that I believe will be improved over time with regards to speed. Lots of improvements are still possible. If you can, upgrade to the dev version and see if there are any changes for the groupby. With your feedback, I'm sure the datatable team will work to improve the speed
@samukweku nice, I'll.
I'm not a too advanced programmer but I may contribute as far as I can.
Just ran the count
test on a file on my pc and I agree with you, pandas
count after grouping is about 4x faster than datatable
.
I would also suggest you change your title to something more descriptive - say count after grouping is slower in datatable than in Pandas. And speed conversion to pandas from datatable is slower compared to some others you know. The premise of datatable
is speed; I believe the maintainers would be keen to work on your feedback. You could even test more features and provide feedback; that way the maintainers can identify what to improve on.
Is there a link to the dataset in question? The performance may depend on characteristics of data, such as column types, number of unique values, etc.
@st-pasha , I used one of the files from this dataset : http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2001.tgz, specifically the Performance_2000Q1.txt
- I do not have the hardware specs of @acpguedes , so I did not run the entire 16gb.
@st-pasha the column type used to groupby
is a string.
I had no way to share, but, anyway, I only used the df itself for a test.
Anyway, t is a gff-like file.
I can run any other test as simple as possible if you send me some df and the commands.
I run for the same data @samukweku used but with Performance_2001Q4.txt_1.
%timeit df[ : , dt.count(), dt.by("C0")]
7.71 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dfp.groupby("C0").size()
412 ms ± 4.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It is a huge difference. I also expected it to be faster than Pandas as it is in R with data.table.