vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[Question] Large file export slowing to a crawl overnight

Open ljmartin opened this issue 2 years ago • 6 comments

Hi Vaex, I'm concatenating two parquet files using vaex with the script below - initially it expected ~13 hours to do this via the progress bar, but overnight it's crept up to 45 hours, which doesn't seem right particularly with 96 cpus going like blazes for 12 hours now. Is that expected behaviour? Is there a better way to achieve this end?

Code:

import vaex as vx

vdf = vx.open_many(['./parquets/file_one.parquet', './parquets/file_two.parquet'])

vdf.export('./concat.parquet',
           progress=True,
           )

The files themselves are 28 columns with some small text strings in a few columns, and ints/nans in the other columns. file_one.parquet is 185gb, and file_two.parquet is 43gb. Thanks

ljmartin avatar May 26 '22 21:05 ljmartin

Are you writing to an SSD disk or HDD? I would expect the operation to be much faster than 13 hours.. if you are just concatenating and exporting to be honest.. basically limited to the write speed of your disk..

(@maartenbreddels )

JovanVeljanoski avatar May 26 '22 21:05 JovanVeljanoski

Thanks for your time @JovanVeljanoski - It's to a gp2 (general purpose SDD) drive on EC2

ljmartin avatar May 26 '22 21:05 ljmartin

Ah then there is some network overhead probably.. i would not know. Can you check if you are reading-writing to a resource in the very same zone? That gives optimal performance network wise.

JovanVeljanoski avatar May 26 '22 22:05 JovanVeljanoski

I'm ssh'd into the EC2 instance, and running the code remotely, so there's no network transfer in this case - the input parquets are already stored on the same disk as well. That means I think the same thing would happen if I ran this locally - it's just that I don't have enough disk space at the moment!

ljmartin avatar May 26 '22 23:05 ljmartin

Perhaps it has something to do with the size of the partitions in the parquet file? For instance, this script runs in about 50 minutes, single-threaded:

from pyarrow import parquet
from tqdm import tqdm

f = parquet.ParquetFile('./parquets/file_one.parquet')
num_row_groups = f.num_row_groups
for nrg in tqdm(range(num_row_groups)):

    rg = f.read_row_group(nrg)
    parquet.write_table(rg, f'./chunks/file_one_{nrg}.parquet')

Each chunk is a shade over 1e6 rows. It doesn't lead to a concatenated file, but at least indicates that networking speed isn't an issue.

ljmartin avatar May 27 '22 01:05 ljmartin

@ljmartin The size of the row groups definitely affected the performance for me, but that was a more extreme case when my ~100 million rows had row groups in the size range of 1-10. Coalescing them sped everything up for me. IDK though, 1e6 seems moderately large, though it is 10x smaller than the default chunk size that vaex uses when writing parquets. For 185 Gb perhaps the overhead of small row groups doesn't scale linearly. If you want some starter code for doing the coalescing, see this gist

NickCrews avatar Aug 29 '22 04:08 NickCrews