vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[BUG-REPORT]Performance issues while working with chunked data

Open wckdman opened this issue 2 years ago • 1 comments

Description Method to_pandas_df works significantly slower on chunked data than on one file. Experimental data consist of 100k rows and 800 columns stored in two versions:

  • one parquet file
  • ten parquet files with 10k rows each

I apply a set of operations such as sort and sample followed by to_pandas_df. These operations perform 5 times slower on chunked data. Also I noticed that performance on csv files is better

Here's a code snippet:

import vaex
df = vaex.open("data/*.parquet")
df = df.sample(n=20_000, random_state=42)
pdf = df.to_pandas_df()

Software information

  • Vaex version: vaex-core==4.9.1, pyarrow==8.0.0, fastparquet==0.8.1
  • Vaex was installed via: pip
  • OS: MacOS 11.6.6

wckdman avatar May 25 '22 17:05 wckdman

Yes that is expected i believe. Which is why we recommend using a single file for optimal performance.

Especially if you do things like sample - then you are randomly accessing rows, which is the least efficient thing to do in vaex. I don't know what your usecase is (if you are exploring and need to see different bits of the data, or if it is part of your computational process), but you want to avoid that in general.

Sometimes if we need to shuffle or sort, we do that operation, and then export the result to disk, so then read access is sequential and faster.

perhaps @maartenbreddels can provide more info here.

JovanVeljanoski avatar Aug 07 '22 14:08 JovanVeljanoski