arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Reading only a subset of columns

Open CarlColglazier opened this issue 4 years ago • 2 comments

Please correct me if this is possible already. I looked through the source code and the documentation and did not find a clear way to do this: basically, I want to read a FeatherV2 file, but not mmap every single column. I already know which columns I need and I'd like to tell Arrow.Table the subset of columns I want read into memory.

This is similar to this issue on Feather.jl.

This seems to be possible in the R arrow package using col_select.

CarlColglazier avatar Dec 15 '20 14:12 CarlColglazier

Hey @CarlColglazier, thanks for opening an issue. We could probably support keyword arguments like select and drop, but note that it wouldn't change how much memory is "mmapped". Arrow tables are stored in a single memory blob and there isn't really a way to only mmap a few columns. You still have to read the header/metadata to figure out the offsets of specific columns into the data.

So, happy to support select/drop, since it can be convenient to only get back the columns you really need, but I just want to point out that I wouldn't expect there to be any real effect on memory/performance.

quinnj avatar Dec 15 '20 15:12 quinnj

I went through the feather c++ source code and it seems this hasn't been fixed yet in the upstream C++ api. Am i right ?

JayjeetAtGithub avatar Jun 19 '21 13:06 JayjeetAtGithub