vaex icon indicating copy to clipboard operation
vaex copied to clipboard

concat many hdf5 files fast

Open Ben-Epstein opened this issue 2 years ago • 0 comments

Helper function to concatenate many hdf5 files. Tested against hundreds of thousands of files.

I could imagine using this when a user globs with a .open where vaex can call this to concat the files (maybe make the os.remove optional), and create the final file for vaex.

TODO: You'll see in the code that I handle string columns less than idea. I know that vaex creates a data and indices group for string columns. I was able to recreate and append to that successfully, but was unable to get vaex to properly read it. I believe that is because vaex cannot mmap string columns from chunked hdf5 files, but that may be incorrect (just my best guess reading the source code).

So currently the columns would come back as byte arrays, and would need to be casted like so

df[col] = df[col].to_arrow().cast(pa.large_string())

i'm sure we can figure out a better solution here.

CC @maartenbreddels

Ben-Epstein avatar Feb 11 '22 17:02 Ben-Epstein