datashader datashader memory consumption on very large dataset

We are dealing with total of 140 million points from ~500 files. It is great that datashader can draw it out, but it also took 100GB of peak memory and about 40 minutes on a couple of cores getting the data into dataframe while the overall system cpu utilization is pretty spare. Is there anyway to both reduce the memory requirement and increase/parallelize the CPU usage?

Thanks

Aug 27 '18 12:08 tropicalpb

Are you already using dask? It will let you do all the computation out-of-core if necessary. That said 140 million points shouldn't take 100GB of peak memory, how are you loading the files and are you loading various columns you're not actually using? We generally recommend saving out your data to parquet files, which are much faster to load and support chunked reads. This way we routinely process 1 billion point datasets on our 16GB laptops.

Aug 27 '18 12:08 philippjfr

we are already using dask. Is there any setting to enable to do out-of-core? We are supporting custom filter on other columns, so reading them in is a must.

fullpath = '/data/dir/*.csv' frame = dd.read_csv(fullpath, dtype=dtype, parse_dates=datetimeop, storage_options=None, assume_missing=True);

Aug 27 '18 13:08 tropicalpb

Dask works out of core by default, but will force everything to be in memory if there is a call todf.persist(). Most of our example notebooks use .persist() for speed, so if you want to work out of core, make sure you haven't copied that bit.

That said, working out of core with CSV files seems like a really bad idea -- you'd end up reprocessing the CSV file every single time you zoom or pan in any plot, so CSV-file processing will dwarf any computation actually done by Datashader. Converting to something more efficient like Parquet seems like a must unless you only ever want to do a single, static visualization per file.

Aug 27 '18 13:08 jbednar

thanks, we changed to parquet and memory comes down significantly but it still only uses a couple of cores for data loading and most of systems other 60 cores are sitting idle. We need to call persist() for the interactive speed, is there any way to parallel load?

Aug 27 '18 14:08 tropicalpb

How did you chunk the parquet data? As long as the data is sufficiently chunked I would expect the loading to happen in parallel.

Aug 27 '18 16:08 philippjfr

The data is in Hadoop HDFS with default block size of 512MB. There are several hundred files, so even without the chunking just the number of files alone should be parallel, but that doesn't appear to be the case.

Aug 28 '18 12:08 tropicalpb