DTables.jl How does one make DTable construction lazy?

How does one make DTable construction lazy?

Open salbert83 opened this issue 3 years ago • 1 comments

I tried

tbl = Dagger.DTable(Parquet.read_parquet, my_files)

where "my_files" is an array of paths to parquet files that were from a dask dataframe. It seems to be loading everything into memory. I'd like a way to process out-of-core, similar to dask, I was under the impression this was a goal for DTable. Thanks.

Jan 05 '22 03:01 salbert83

You can give https://github.com/JuliaData/MemPool.jl/pull/60 a try, which is my new WIP approach to swap-to-disk (just set the env. var. JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 to enable it). I will warn you that it's not ready yet:

Performance of swapped-out data reads is currently bad, due to not properly migrating data back to memory (instead reading from disk for every read)
The memory usage limit is not yet tunable, and defaults to 8GB
The disk usage limit is currently unbounded, and will use all of your disk space if you allocate too much (everything will be stored in .mempool relative to your current working directory, if you need to manually delete those files)

I plan to begin DTable testing of that PR soon, but haven't yet had the chance to get to it, but do feel free to give it a spin! I'll let you know once I've fixed the above issues.

Jan 05 '22 14:01 jpsamaroo

DTables.jl DTables.jl copied to clipboard

How does one make DTable construction lazy?

DTables.jl
DTables.jl copied to clipboard