vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[FEATURE-REQUEST] Export vaex as HDF5 file while upload/stream upload to S3 Bucket

Open revathik1991 opened this issue 3 years ago • 6 comments

As vaex supports the stream reading of hdf5 file from S3 bucket, any plans to implement an api to upload/stream upload the vaex as hdf5/arrow file to S3 bucket?

revathik1991 avatar Nov 06 '20 15:11 revathik1991

Hi,

yes, current master (and the latest alpha release of vaex-core) supports streaming parquet and arrow to a bucket directly. @JovanVeljanoski is currently working on the docs for this. Feel free to try and ask if you find issues. hdf5 to a bucket is not possible, because the hdf5 library assuming the underlying file object can be seeked (so go back and forth). We will write down all options soon, what the cons and pros are as well.

cheers,

Maarten

maartenbreddels avatar Nov 06 '20 15:11 maartenbreddels

That's a great news @maartenbreddels . I'll try stream upload using alpha version. Since the latest alpha supports stream uploading of arrow/parquet file to s3, does it support stream reading of arrow/parquet from s3 as well ?(Right now, vaex 3.0.0 supports only stream read of hdf5)

revathik1991 avatar Nov 06 '20 16:11 revathik1991

Yes it should work, the only things that doesn't work is caching of those files, so you better use it directly from the AWS region the bucket is on for best performance. See https://twitter.com/maartenbreddels/status/1322261521204977664?s=20

maartenbreddels avatar Nov 06 '20 18:11 maartenbreddels

I'm using the latest alpha version, however I'm facing issue import vaex df.export_hdf5('s3://xxx/yyy/data.hdf5')

OSError: Unable to create file (unable to open file: name = 's3://xxx/yyy/data.hdf5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

parvathtarun avatar Feb 08 '21 21:02 parvathtarun

same here, I still have the same problem as @parvathtarun. Did you find any solutions?

rey-eb avatar Jun 28 '22 14:06 rey-eb

No, it's not possible.

hdf5 to a bucket is not possible, because the hdf5 library assuming the underlying file object can be seeked (so go back and forth).

I have experimented with support for this to make a stream act like a file, but that works is not done yet.

maartenbreddels avatar Jul 26 '22 15:07 maartenbreddels