datashader
datashader copied to clipboard
canvas.points memory observation
I am seeing memory double for the datashader process during the call
agg=csv.points(df,'x','y',ds.count('t'))
When the call finishes memory returns to the amount prior to the aggregate call. I have not reviewed the code, but am wondering if the user should have control of the variable type used for the aggregation, or add a 'hint' if it can conserve memory. I'm speculating this is using a double or 64 bit integer, ideally an unsigned 'int' would be an option.
There will normally be copies of the partitioned aggregate array for each core/cpu, and then all of these will be gathered into a single array, so yes, the peak memory for the aggregate array will be double the final size. This is normally fine, since we assume the aggregate array will be much smaller than the original data, but that's an assumption that may not hold in all cases. In any case, we'd be happy to accept a PR allowing the aggregate datatype to be chosen, if that can be done cleanly.
Some thoughts, not tested with datashader. Create bins for canvas cells based on the following.
- Sparse matrix test.
- Hash code using geohash or s2sphere I'm seeing some very large sizes for the count, ideally some format of hints for choosing the count variable type, uint32 would be ideal, if numpy, using pandas, it seems we have the 64 bit choice only, which double memory and halves total count since it range is signed.
On Tue, Jul 18, 2017 at 5:53 PM, James A. Bednar [email protected] wrote:
There will normally be copies of the partitioned aggregate array for each core/cpu, and then all of these will be gathered into a single array, so yes, the peak memory for the aggregate array will be double the final size. This is normally fine, since we assume the aggregate array will be much smaller than the original data, but that's an assumption that may not hold in all cases. In any case, we'd be happy to accept a PR allowing the aggregate datatype to be chosen, if that can be done cleanly.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bokeh/datashader/issues/412#issuecomment-316209400, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXVTY31fb5V6Uxlpq1SWCY3r_5pb6Coks5sPSlggaJpZM4OZ0dr .
Here are some measurements:
This image should be 13.3MB 'IF we could use' grayscale or 32 bit unsigned values for the count.
canvas 'xpixels': 3564, 'ypixels': 982
2017-07-22T16:29:24.048683> Prior cvs.points aggregation <<MEMORY>> 187 MB Working, 620 MB PeakWorking
2017-07-22T16:29:47.700369> Post cvs.points aggregation <<MEMORY>> 1,308 MB Working, 2,403 MB PeakWorking
I'm not sure how to interpret those numbers; is there a reason to think that the changes in memory are primarily due to the canvas size? At least with dask dataframes, the amount of memory used depends greatly on how much of the original data is being accessed, not just the canvas.
Hello, no to your question on the canvas size direct impact, I was providing that for context.
The input dataframe collection is 3,453,610 rows a record is epoch (double), lat, lon which are converted to 32 bit float. Using the following line, the dataframe is 55,264,316 bytes
memory = df.memory_usage(deep=True).compute()
Attaching dask profiler views
I am testing the 2DH method with same input data as the DataShader agg and count approach. The 2DH is using much less memory ~4GB in this case, and completing where the datashader agg is running out of memory on a 32GB memory system.
The 2DH implementation is shown at: https://github.com/bokeh/datashader/issues/457
The input data count is just over 3.4 million rows.
{'bytes': 1399511520, 'gbytes': 1.3033966720104218, 'mbytes': 1334.6781921386719, 'xpixels': 35640, 'ypixels': 9817}