rio
rio copied to clipboard
use arrow::read_parquet instead of nanoparquet
I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage
expression median mem_alloc name size
<char> <num> <num> <char> <char>
1: df_parquet 1.153 5.578 write small
2: df_nanoparquet 0.674 183.986 write small
3: dt_parquet 5.172 0.018 write small
4: dt_nanoparquet 0.656 183.876 write small
5: df_parquet 10.878 0.015 write big
6: df_nanoparquet 10.182 2068.884 write big
7: dt_parquet 11.461 0.015 write big
8: dt_nanoparquet 10.038 2068.947 write big
9: df_parquet 0.088 34.901 read small
10: df_nanoparquet 0.414 183.187 read small
11: df_parquet 1.187 0.009 read big
12: df_nanoparquet 5.180 1324.072 read big
speed and RAM usage when reading big files are not very good .
on nanoparquet repo they say :
Being single-threaded and not fully optimized,
nanoparquet is probably not suited well for large data sets.
It should be fine for a couple of gigabytes.
Reading or writing a ~250MB file that has 32 million rows
and 14 columns takes about 10-15 seconds on an M2 MacBook Pro.
For larger files, use Apache Arrow or DuckDB.
rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet
If you keep nanoparquet as default maybe we could have an option to use arrow instead?
@BenoitLondon Thank you for the benchmark.
As you asked the why question: long story short of #315 , we wanted Parquet support by default. At first, R 1.0.0 with arrow. But quickly reverted due to installation concerns. Then later, nanoparquet by @gaborcsardi was supposed to be installed by default, because it's dependency free. But it was, again, reverted due to the insufficient support for Big Endian platforms r-lib/nanoparquet#21. And therefore, we have the funny state of reading parquet with nanoparquet but feather with arrow. Going back to pre 1.1, arrow was used for both parquet and feather in the so-called Suggests tier.
We are somehow reluctant in introducing options for choosing which package to use. We are still cleaning those up from the pre-1.0 era. I don't mind switching back to arrow altogether. At the same time, I also believe that @gaborcsardi is actively developing nanoparquet to make it more efficient.
Can you share the code for the benchmark?
Some notes:
- The dev version of nanoparquet has a completely rewritten
read_parquet(), which is much faster. (See below) - I suspect that you can't really compare
mem_allocbecause if only includes memory allocated within R, and arrow probably allocates most of its memory in C/C++. - I am not totally sure how to interpret the results. E.g. does
mean that nanoparquet is 8 times faster here? Or 8 times slower?expression median mem_alloc name size 3: dt_parquet 5.172 0.018 write small 4: dt_nanoparquet 0.656 183.876 write small
Not really a good benchmark, but I just ran arrow and nanoparquet on the mentioned 33 milion row data set (10x flights from nycflights13), and nanoparquet is about 2 times faster when writing, and about the same in reading. (This is with options(arrow.use_altrep = FALSE), so that arrow actually reads the data.)
It would be great to have a proper benchmark, but nevertheless I'll update note in the nanoparquet README, because it is acually competitive in terms of speed. I suspect that it is also competitive in terms of memory, but we'd need a better way to measure that.
Oh thanks guys for the explanations, very much appreciated! I guess my benchmarks were not very well designed. I suspected there was some ΓLTREP magic behind those numbers and for the ram as well it didn't look correct. I will use some summary after reading to make sure the data is actually loaded into R
median is the median time of 3 iterations so yeah in the small dataset case nano is 8 times faster than arrow.
I m very happy to use nanoparquet if there s no downside (my use case is basically writing /reading biggish files (1-5 GB) in R and also reading in python or Julia so I wanted compatibility and speed and low ram usage if possible)
Thanks again. I will share my benchmark when fixed ;)
It is a question how much this generalizes, but nanoparquet does not look bad at all: https://nanoparquet.r-lib.org/dev/articles/benchmarks.html#parquet-implementations-1
Thanks @gaborcsardi, I find similar results but not sure why I still find parquet to read faster in my benchmarks, though nano is faster at writing and doing a full cycle read + write
I think I disabled ALTREP properly , maybe number of cores (I have 72 there) makes a difference.
Anyway I'm very happy to use nanoparquet through rio as performance looks on par!
expression size median mem_alloc name fn filesize
<char> <char> <num> <num> <char> <char> <num>
1: df_nanoparquet big 21.28752221 4.548025e+03 full df_big_test_nano.parquet 238
2: df_parquet big 26.38431233 2.466664e+03 full df_big_test_ar.parquet 240
3: df_nanoparquet big 5.89444197 2.967702e+03 read df_big_test_nano.parquet 238
4: df_parquet big 2.52957607 2.466656e+03 read df_big_test_ar.parquet 240
5: df_nanoparquet big 8.89001248 1.580325e+03 write df_big_test_nano.parquet 238
6: df_parquet big 10.45254921 1.748657e-02 write df_big_test_ar.parquet 240
7: dt_nanoparquet big 8.61057447 1.580388e+03 write dt_big_test_nano.parquet 238
8: dt_parquet big 10.74620440 1.768494e-02 write dt_big_test_ar.parquet 240
9: df_nanoparquet small 0.55284996 1.519936e+02 full df_test_nano.parquet 8
10: df_parquet small 0.45033587 8.611523e+01 full df_test_ar.parquet 8
11: df_nanoparquet small 0.17726478 9.927212e+01 read df_test_nano.parquet 8
12: df_parquet small 0.09908626 8.806947e+01 read df_test_ar.parquet 8
13: df_nanoparquet small 0.24138912 5.308556e+01 write df_test_nano.parquet 8
14: df_parquet small 0.34817094 5.590614e+00 write df_test_ar.parquet 8
15: dt_nanoparquet small 0.24777139 5.294093e+01 write dt_test_nano.parquet 8
16: dt_parquet small 0.41581183 1.983643e-02 write dt_test_ar.parquet 8
Here's my script for info: file_format_benchmark.txt
@BenoitLondon With which versions of the packages?
I am also not sure if you can just run bench::mark() because Arrow or the OS may reuse the already open memory maps, so reading the same file the second time will not actually read it again.
But yeah, it is also true in general that the results will vary among systems. In particular, the concurrent I/O in Arrow will take advantage of more advanced I/O architectures, probably.
> packageVersion("nanoparquet")
[1] β0.3.1β
> packageVersion("arrow")
[1] β17.0.0.1β
> R.version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 3.2
year 2023
month 10
day 31
svn rev 85441
language R
version.string R version 4.3.2 (2023-10-31)
nickname Eye Holes
and I agree it s likely the reason for arrow looking faster at reading as when I do a full cycle it does not show anymore. :) Thanks for your package!
You need to run the dev version of nanoparquet, from the GitHub repo.