arrow icon indicating copy to clipboard operation
arrow copied to clipboard

read_parquet super slow

Open PursuitOfDataScience opened this issue 3 years ago • 8 comments

My computer has 64 GB RAM and the parquet files are not large at all. However, they take surprisingly long time to read. I don't know what is the potential cause. If you need more info, please feel free to let me know.

image

PursuitOfDataScience avatar Jul 26 '22 23:07 PursuitOfDataScience

Hi @PursuitOfDataScience . Could you give us a little bit more information on how are you reading the parquet files? Are you reading them through R, Pyarrow, C++? Maybe you can share a snippet of the read. Thanks

raulcd avatar Jul 27 '22 09:07 raulcd

Just using arrow::read_parquet() in R. Everything is regular, nothing fancy. Anything else I need to provide?

PursuitOfDataScience avatar Jul 27 '22 12:07 PursuitOfDataScience

Thanks! What version of Arrow are you using?

raulcd avatar Jul 27 '22 13:07 raulcd

arrow_8.0.0

PursuitOfDataScience avatar Jul 27 '22 23:07 PursuitOfDataScience

How were the files produced? Which compression and encoding do they use? How many columns are there?

pitrou avatar Aug 04 '22 07:08 pitrou

They were SAS datasets. I used haven::read_sas() to load them into memory and then used arrow::write_parquet() to write them into disk for future use.

PursuitOfDataScience avatar Aug 04 '22 12:08 PursuitOfDataScience

Thanks, can you answer the other questions as well: which compression and encoding do they use? How many columns are there?

pitrou avatar Aug 04 '22 13:08 pitrou

I am not sure about the first question, but there are 80 columns or so.

PursuitOfDataScience avatar Aug 04 '22 23:08 PursuitOfDataScience