db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

not enough memory to read 1e9 data

Open jangorecki opened this issue 6 years ago • 4 comments
trafficstars

Despite csv file being < 50 GB pandas and dask are unable to successfully read this csv on a 125 GB machine. They both run out of memory. As a result pandas and dask groupby task runs only for 1e7 (0.5 GB) and 1e8 (5 GB) data sizes. My understanding is that root cause is likely the same, memory-inefficient way of how DataFrames stores strings.

jangorecki avatar Oct 17 '19 05:10 jangorecki

dask addresses this issue by using on-disk data storage https://github.com/h2oai/db-benchmark/issues/126

jangorecki avatar Dec 01 '19 05:12 jangorecki

This is still issue for pandas 1.0.3. For dask I will check that once we will have #144 merged.

jangorecki avatar May 13 '20 13:05 jangorecki

dask is now capable to load 1e9 after #144.

jangorecki avatar Jun 22 '20 12:06 jangorecki

Unfortunately it cannot complete any of the groupby queries, so reverting to use on-disk format again.

jangorecki avatar Jun 22 '20 20:06 jangorecki