rio icon indicating copy to clipboard operation
rio copied to clipboard

use arrow::read_parquet instead of nanoparquet

Open BenoitLondon opened this issue 10 months ago β€’ 8 comments

I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage

        expression median mem_alloc   name   size
            <char>  <num>     <num> <char> <char>
 1:     df_parquet  1.153     5.578  write  small
 2: df_nanoparquet  0.674   183.986  write  small
 3:     dt_parquet  5.172     0.018  write  small
 4: dt_nanoparquet  0.656   183.876  write  small
 5:     df_parquet 10.878     0.015  write    big
 6: df_nanoparquet 10.182  2068.884  write    big
 7:     dt_parquet 11.461     0.015  write    big
 8: dt_nanoparquet 10.038  2068.947  write    big
 9:     df_parquet  0.088    34.901   read  small
10: df_nanoparquet  0.414   183.187   read  small
11:     df_parquet  1.187     0.009   read    big
12: df_nanoparquet  5.180  1324.072   read    big

speed and RAM usage when reading big files are not very good .

on nanoparquet repo they say :

Being single-threaded and not fully optimized, 
nanoparquet is probably not suited well for large data sets. 
It should be fine for a couple of gigabytes. 
Reading or writing a ~250MB file that has 32 million rows 
and 14 columns takes about 10-15 seconds on an M2 MacBook Pro.
 For larger files, use Apache Arrow or DuckDB.

rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet

If you keep nanoparquet as default maybe we could have an option to use arrow instead?

BenoitLondon avatar Jan 16 '25 09:01 BenoitLondon

@BenoitLondon Thank you for the benchmark.

As you asked the why question: long story short of #315 , we wanted Parquet support by default. At first, R 1.0.0 with arrow. But quickly reverted due to installation concerns. Then later, nanoparquet by @gaborcsardi was supposed to be installed by default, because it's dependency free. But it was, again, reverted due to the insufficient support for Big Endian platforms r-lib/nanoparquet#21. And therefore, we have the funny state of reading parquet with nanoparquet but feather with arrow. Going back to pre 1.1, arrow was used for both parquet and feather in the so-called Suggests tier.

We are somehow reluctant in introducing options for choosing which package to use. We are still cleaning those up from the pre-1.0 era. I don't mind switching back to arrow altogether. At the same time, I also believe that @gaborcsardi is actively developing nanoparquet to make it more efficient.

chainsawriot avatar Jan 16 '25 11:01 chainsawriot

Can you share the code for the benchmark?

Some notes:

  • The dev version of nanoparquet has a completely rewritten read_parquet(), which is much faster. (See below)
  • I suspect that you can't really compare mem_alloc because if only includes memory allocated within R, and arrow probably allocates most of its memory in C/C++.
  • I am not totally sure how to interpret the results. E.g. does
            expression median mem_alloc   name   size
     3:     dt_parquet  5.172     0.018  write  small
     4: dt_nanoparquet  0.656   183.876  write  small
    
    mean that nanoparquet is 8 times faster here? Or 8 times slower?

Not really a good benchmark, but I just ran arrow and nanoparquet on the mentioned 33 milion row data set (10x flights from nycflights13), and nanoparquet is about 2 times faster when writing, and about the same in reading. (This is with options(arrow.use_altrep = FALSE), so that arrow actually reads the data.)

It would be great to have a proper benchmark, but nevertheless I'll update note in the nanoparquet README, because it is acually competitive in terms of speed. I suspect that it is also competitive in terms of memory, but we'd need a better way to measure that.

gaborcsardi avatar Jan 16 '25 12:01 gaborcsardi

Oh thanks guys for the explanations, very much appreciated! I guess my benchmarks were not very well designed. I suspected there was some Γ€LTREP magic behind those numbers and for the ram as well it didn't look correct. I will use some summary after reading to make sure the data is actually loaded into R

median is the median time of 3 iterations so yeah in the small dataset case nano is 8 times faster than arrow.

I m very happy to use nanoparquet if there s no downside (my use case is basically writing /reading biggish files (1-5 GB) in R and also reading in python or Julia so I wanted compatibility and speed and low ram usage if possible)

Thanks again. I will share my benchmark when fixed ;)

BenoitLondon avatar Jan 17 '25 00:01 BenoitLondon

It is a question how much this generalizes, but nanoparquet does not look bad at all: https://nanoparquet.r-lib.org/dev/articles/benchmarks.html#parquet-implementations-1

gaborcsardi avatar Jan 27 '25 18:01 gaborcsardi

Thanks @gaborcsardi, I find similar results but not sure why I still find parquet to read faster in my benchmarks, though nano is faster at writing and doing a full cycle read + write

I think I disabled ALTREP properly , maybe number of cores (I have 72 there) makes a difference.

Anyway I'm very happy to use nanoparquet through rio as performance looks on par!

        expression   size      median    mem_alloc   name                       fn filesize
            <char> <char>       <num>        <num> <char>                   <char>    <num>
 1: df_nanoparquet    big 21.28752221 4.548025e+03   full df_big_test_nano.parquet      238
 2:     df_parquet    big 26.38431233 2.466664e+03   full   df_big_test_ar.parquet      240
 3: df_nanoparquet    big  5.89444197 2.967702e+03   read df_big_test_nano.parquet      238
 4:     df_parquet    big  2.52957607 2.466656e+03   read   df_big_test_ar.parquet      240
 5: df_nanoparquet    big  8.89001248 1.580325e+03  write df_big_test_nano.parquet      238
 6:     df_parquet    big 10.45254921 1.748657e-02  write   df_big_test_ar.parquet      240
 7: dt_nanoparquet    big  8.61057447 1.580388e+03  write dt_big_test_nano.parquet      238
 8:     dt_parquet    big 10.74620440 1.768494e-02  write   dt_big_test_ar.parquet      240
 9: df_nanoparquet  small  0.55284996 1.519936e+02   full     df_test_nano.parquet        8
10:     df_parquet  small  0.45033587 8.611523e+01   full       df_test_ar.parquet        8
11: df_nanoparquet  small  0.17726478 9.927212e+01   read     df_test_nano.parquet        8
12:     df_parquet  small  0.09908626 8.806947e+01   read       df_test_ar.parquet        8
13: df_nanoparquet  small  0.24138912 5.308556e+01  write     df_test_nano.parquet        8
14:     df_parquet  small  0.34817094 5.590614e+00  write       df_test_ar.parquet        8
15: dt_nanoparquet  small  0.24777139 5.294093e+01  write     dt_test_nano.parquet        8
16:     dt_parquet  small  0.41581183 1.983643e-02  write       dt_test_ar.parquet        8

Here's my script for info: file_format_benchmark.txt

Image

BenoitLondon avatar Jan 28 '25 12:01 BenoitLondon

@BenoitLondon With which versions of the packages?

I am also not sure if you can just run bench::mark() because Arrow or the OS may reuse the already open memory maps, so reading the same file the second time will not actually read it again.

But yeah, it is also true in general that the results will vary among systems. In particular, the concurrent I/O in Arrow will take advantage of more advanced I/O architectures, probably.

gaborcsardi avatar Jan 28 '25 12:01 gaborcsardi

> packageVersion("nanoparquet")
[1] β€˜0.3.1’
> packageVersion("arrow")
[1] β€˜17.0.0.1’
> R.version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.2                         
year           2023                        
month          10                          
day            31                          
svn rev        85441                       
language       R                           
version.string R version 4.3.2 (2023-10-31)
nickname       Eye Holes  

and I agree it s likely the reason for arrow looking faster at reading as when I do a full cycle it does not show anymore. :) Thanks for your package!

BenoitLondon avatar Jan 28 '25 15:01 BenoitLondon

You need to run the dev version of nanoparquet, from the GitHub repo.

gaborcsardi avatar Jan 28 '25 16:01 gaborcsardi