datashader icon indicating copy to clipboard operation
datashader copied to clipboard

canvas.points memory observation

Open apiszcz opened this issue 7 years ago • 7 comments

I am seeing memory double for the datashader process during the call

agg=csv.points(df,'x','y',ds.count('t'))

When the call finishes memory returns to the amount prior to the aggregate call. I have not reviewed the code, but am wondering if the user should have control of the variable type used for the aggregation, or add a 'hint' if it can conserve memory. I'm speculating this is using a double or 64 bit integer, ideally an unsigned 'int' would be an option.

apiszcz avatar Jul 17 '17 10:07 apiszcz

There will normally be copies of the partitioned aggregate array for each core/cpu, and then all of these will be gathered into a single array, so yes, the peak memory for the aggregate array will be double the final size. This is normally fine, since we assume the aggregate array will be much smaller than the original data, but that's an assumption that may not hold in all cases. In any case, we'd be happy to accept a PR allowing the aggregate datatype to be chosen, if that can be done cleanly.

jbednar avatar Jul 18 '17 21:07 jbednar

Some thoughts, not tested with datashader. Create bins for canvas cells based on the following.

  • Sparse matrix test.
  • Hash code using geohash or s2sphere I'm seeing some very large sizes for the count, ideally some format of hints for choosing the count variable type, uint32 would be ideal, if numpy, using pandas, it seems we have the 64 bit choice only, which double memory and halves total count since it range is signed.

On Tue, Jul 18, 2017 at 5:53 PM, James A. Bednar [email protected] wrote:

There will normally be copies of the partitioned aggregate array for each core/cpu, and then all of these will be gathered into a single array, so yes, the peak memory for the aggregate array will be double the final size. This is normally fine, since we assume the aggregate array will be much smaller than the original data, but that's an assumption that may not hold in all cases. In any case, we'd be happy to accept a PR allowing the aggregate datatype to be chosen, if that can be done cleanly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bokeh/datashader/issues/412#issuecomment-316209400, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXVTY31fb5V6Uxlpq1SWCY3r_5pb6Coks5sPSlggaJpZM4OZ0dr .

apiszcz avatar Jul 20 '17 22:07 apiszcz

Here are some measurements:

This image should be 13.3MB 'IF we could use' grayscale or 32 bit unsigned values for the count.

canvas 'xpixels': 3564, 'ypixels': 982

2017-07-22T16:29:24.048683> Prior cvs.points aggregation <<MEMORY>> 187 MB Working, 620 MB PeakWorking
2017-07-22T16:29:47.700369> Post cvs.points aggregation <<MEMORY>> 1,308 MB Working, 2,403 MB PeakWorking

apiszcz avatar Jul 22 '17 20:07 apiszcz

I'm not sure how to interpret those numbers; is there a reason to think that the changes in memory are primarily due to the canvas size? At least with dask dataframes, the amount of memory used depends greatly on how much of the original data is being accessed, not just the canvas.

jbednar avatar Jul 22 '17 23:07 jbednar

Hello, no to your question on the canvas size direct impact, I was providing that for context.

The input dataframe collection is 3,453,610 rows a record is epoch (double), lat, lon which are converted to 32 bit float. Using the following line, the dataframe is 55,264,316 bytes

memory = df.memory_usage(deep=True).compute()

apiszcz avatar Jul 23 '17 07:07 apiszcz

Attaching dask profiler views

profiler

apiszcz avatar Jul 25 '17 18:07 apiszcz

I am testing the 2DH method with same input data as the DataShader agg and count approach. The 2DH is using much less memory ~4GB in this case, and completing where the datashader agg is running out of memory on a 32GB memory system.

The 2DH implementation is shown at: https://github.com/bokeh/datashader/issues/457

The input data count is just over 3.4 million rows.

{'bytes': 1399511520, 'gbytes': 1.3033966720104218, 'mbytes': 1334.6781921386719, 'xpixels': 35640, 'ypixels': 9817}

apiszcz avatar Sep 10 '17 12:09 apiszcz