DTables.jl
DTables.jl copied to clipboard
How does one make DTable construction lazy?
I tried
tbl = Dagger.DTable(Parquet.read_parquet, my_files)
where "my_files" is an array of paths to parquet files that were from a dask dataframe. It seems to be loading everything into memory. I'd like a way to process out-of-core, similar to dask, I was under the impression this was a goal for DTable. Thanks.
You can give https://github.com/JuliaData/MemPool.jl/pull/60 a try, which is my new WIP approach to swap-to-disk (just set the env. var. JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1
to enable it). I will warn you that it's not ready yet:
- Performance of swapped-out data reads is currently bad, due to not properly migrating data back to memory (instead reading from disk for every read)
- The memory usage limit is not yet tunable, and defaults to 8GB
- The disk usage limit is currently unbounded, and will use all of your disk space if you allocate too much (everything will be stored in
.mempool
relative to your current working directory, if you need to manually delete those files)
I plan to begin DTable testing of that PR soon, but haven't yet had the chance to get to it, but do feel free to give it a spin! I'll let you know once I've fixed the above issues.