vaex
vaex copied to clipboard
[BUG-REPORT]Performance issues while working with chunked data
Description
Method to_pandas_df
works significantly slower on chunked data than on one file.
Experimental data consist of 100k rows and 800 columns stored in two versions:
- one parquet file
- ten parquet files with 10k rows each
I apply a set of operations such as sort
and sample
followed by to_pandas_df
. These operations perform 5 times slower on chunked data.
Also I noticed that performance on csv files is better
Here's a code snippet:
import vaex
df = vaex.open("data/*.parquet")
df = df.sample(n=20_000, random_state=42)
pdf = df.to_pandas_df()
Software information
- Vaex version: vaex-core==4.9.1, pyarrow==8.0.0, fastparquet==0.8.1
- Vaex was installed via: pip
- OS: MacOS 11.6.6
Yes that is expected i believe. Which is why we recommend using a single file for optimal performance.
Especially if you do things like sample
- then you are randomly accessing rows, which is the least efficient thing to do in vaex. I don't know what your usecase is (if you are exploring and need to see different bits of the data, or if it is part of your computational process), but you want to avoid that in general.
Sometimes if we need to shuffle or sort, we do that operation, and then export the result to disk, so then read access is sequential and faster.
perhaps @maartenbreddels can provide more info here.