datashader icon indicating copy to clipboard operation
datashader copied to clipboard

fix: Improve `compute_chunksize` for downsampled data

Open hoxbro opened this issue 3 months ago • 8 comments

Previously, the following code would fill out all my memory, and then freeze, ending up with the system either killing the process or I needed to do a manual shutdown.

I haven't done any profiling to see if this affects performance, but at least it doesn't crash my computer anymore. If it does, we can maybe move the functionality into resample_2d_distributed.

import numpy as np
import dask.array as da
import datashader as ds
import xarray as xr
import dask

print(dask.__version__)

# create large dask array
N = 100_000
dask_array = da.random.random((N, N), chunks=(1000, 1000))  # .compute()
# convert to dasked xarray
dask_xarray = xr.DataArray(
    dask_array,
    dims=["x", "y"],
    coords={"x": np.arange(N), "y": np.arange(N)},
    name="example_data",  # Name of the data variable
)
# create plot using Datashader
arr = ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray)
arr.compute()

First reported here: https://stackoverflow.com/questions/79753007/high-ram-usage-when-using-datashader-with-dasked-xarray

hoxbro avatar Sep 04 '25 18:09 hoxbro

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 88.34%. Comparing base (f44670c) to head (3863e28).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1448   +/-   ##
=======================================
  Coverage   88.33%   88.34%           
=======================================
  Files          96       96           
  Lines       18901    18905    +4     
=======================================
+ Hits        16696    16701    +5     
+ Misses       2205     2204    -1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 04 '25 19:09 codecov[bot]

Hi, I am the original poster in the Stack Overflow question you have linked. Thank you for looking into this. I installed Datashader from the branch you had created. I ran the same code but I am now getting a different error. See below.

PS: Sorry, not sure how to properly format the error message here.

Traceback (most recent call last)
Cell In[3], line 2
      1 # create plot using Datashader
----> 2 tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))

File D:\ProgramData\environments\test_ds\lib\site-packages\datashader\core.py:1155, in Canvas.raster(self, source, layer, upsample_method, downsample_method, nan_value, agg, interpolate, chunksize, max_mem)
   1151         data = resample_2d_distributed(
   1152             source_window, chunksize=chunksize, max_mem=max_mem,
   1153             **kwargs)
   1154     else:
-> 1155         data = resample_2d(source_window, **kwargs)
   1156     layers = 1
   1157 else:

File D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py:347, in resample_2d(src, w, h, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
    344 if isinstance(src, np.ma.MaskedArray):
    345     src = src.data
--> 347 resampled = _resample_2d(src, mask, use_mask, ds_method, us_method,
    348                          fill_value, mode_rank, x_offset, y_offset, out)
    349 return _mask_or_not(resampled, src, fill_value)

File D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py:499, in _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
    497 def _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value,
    498                  mode_rank, x_offset, y_offset, out):
--> 499     src_w, src_h, out_w, out_h = _get_dimensions(src, out)
    500     x0_off, x1_off = x_offset
    501     y0_off, y1_off = y_offset

File D:\ProgramData\environments\test_ds\lib\site-packages\numba\core\dispatcher.py:424, in _DispatcherBase._compile_for_args(self, *args, **kws)
    420         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    421                f"by the following argument(s):\n{args_str}\n")
    422         e.patch_message(msg)
--> 424     error_rewrite(e, 'typing')
    425 except errors.UnsupportedError as e:
    426     # Something unsupported is present in the user code, add help info
    427     error_rewrite(e, 'unsupported_error')

File D:\ProgramData\environments\test_ds\lib\site-packages\numba\core\dispatcher.py:365, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    363     raise e
    364 else:
--> 365     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py (488)

File "D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py", line 488:
def _get_fill_value(fill_value, src, out):
    <source elided>

@ngjit
^

During: Pass nopython_type_inference 

This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'dask.array.core.Array'>

Nanoputian628 avatar Sep 05 '25 05:09 Nanoputian628

I cannot recreate.

image

Looking at the line numbers, it could be because you haven't installed it correctly, installed it into the right environment, or haven't restarted the notebook.

An example is that resampling.py:499 lines text in you traceback does not match up with this branch.

hoxbro avatar Sep 05 '25 08:09 hoxbro

Sorry, I copied the wrong error message. I had created a new environment where I installed the latest version of all the required packages. I then manually copied over your changes in the resamply.py script (for some reason I couldn't install the Datashader directly from your GitHub branch). When I ran the code I got the same @ngjit decorater error as above. The error message from that run is at the end of this comment. I also created another new environment with the latest packages and without your change. I ran the same code and I got the same @ngjit decorater error (that error message is the one I mistakenly copied in my earlier message). So this seems to be a separate issue? I will open up a separate ticket for this.

Anyway, I created another new environment and this time I installed Datashader 0.16.1 which is the version that I was using in my original environment. I then manually changed the resamplying.py script for your changes. I can now run the code without any errors. The RAM increases by about 8gb and it takes about 1-2 minutes to run. So happy to confirm that the issue is solved. Thanks for your help!

I just also have some related general questions for my own understanding. Would be much appreciated if you had the time to answer them.

  1. Does the chunk size that I set in my dasked array determine in anyway the chunk size used in Datashader? Based on the error I had and the changes you have made, it seems like Datashader determines its own suitable chunk size?
  2. Depending on the above answer, is there a rough guideline on how much RAM I need available when creating a plot in Datashader using dask? For example, would I need to have 5 times the memory size of a single chunk?
  3. Are you aware of any tutorials that explain how to use Datashader and dask together (even better if it also uses Holoviews/Geoviews)? I have seen a couple of guides, but they are all quite minimal examples and don't go into any details/explanation into how dask is being used. I am very fuzzy with using dask, so it would be great to better understand things such has knowing how much workers and threads to use, how to set various memory limits, etc.

Many thanks!

TypingError                               Traceback (most recent call last)
Cell In[3], line 2
      1 # create plot using Datashader
----> 2 tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))

File D:\ProgramData\environments\ds_test\lib\site-packages\datashader\core.py:1155, in Canvas.raster(self, source, layer, upsample_method, downsample_method, nan_value, agg, interpolate, chunksize, max_mem)
   1151         data = resample_2d_distributed(
   1152             source_window, chunksize=chunksize, max_mem=max_mem,
   1153             **kwargs)
   1154     else:
-> 1155         data = resample_2d(source_window, **kwargs)
   1156     layers = 1
   1157 else:

File D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py:353, in resample_2d(src, w, h, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
    350 if isinstance(src, np.ma.MaskedArray):
    351     src = src.data
--> 353 resampled = _resample_2d(src, mask, use_mask, ds_method, us_method,
    354                          fill_value, mode_rank, x_offset, y_offset, out)
    355 return _mask_or_not(resampled, src, fill_value)

File D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py:505, in _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
    503 def _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value,
    504                  mode_rank, x_offset, y_offset, out):
--> 505     src_w, src_h, out_w, out_h = _get_dimensions(src, out)
    506     x0_off, x1_off = x_offset
    507     y0_off, y1_off = y_offset

File D:\ProgramData\environments\ds_test\lib\site-packages\numba\core\dispatcher.py:424, in _DispatcherBase._compile_for_args(self, *args, **kws)
    420         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    421                f"by the following argument(s):\n{args_str}\n")
    422         e.patch_message(msg)
--> 424     error_rewrite(e, 'typing')
    425 except errors.UnsupportedError as e:
    426     # Something unsupported is present in the user code, add help info
    427     error_rewrite(e, 'unsupported_error')

File D:\ProgramData\environments\ds_test\lib\site-packages\numba\core\dispatcher.py:365, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    363     raise e
    364 else:
--> 365     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py (494)

File "D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py", line 494:
def _get_fill_value(fill_value, src, out):
    <source elided>

@ngjit
^

During: Pass nopython_type_inference 

This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'dask.array.core.Array'>

Nanoputian628 avatar Sep 06 '25 00:09 Nanoputian628

This pull request has been mentioned on HoloViz Discourse. There might be relevant details there:

https://discourse.holoviz.org/t/error-when-creating-plot-from-dask-xarray-using-latest-package-versions/8926/1

holovizbot avatar Sep 07 '25 18:09 holovizbot

I also created another new environment with the latest packages and without your change. I ran the same code and I got the same @ngjit decorater error (that error message is the one I mistakenly copied in my earlier message). So this seems to be a separate issue? I will open up a separate ticket for this.

I can recreate the issue with uv and will investigate; I'm not entirely sure what is causing this problem. Can you open an issue, so we don't forget about it, as it is out of scope for this PR.

uv venv --python 3.13
uv pip install dask datashader

For your questions, dask should generally be smart enough to max out your memory. Not sure if the problem lies in datashader or dask, but what I'm doing so far in the PR is changing the chunk size to avoid the memory.

hoxbro avatar Sep 08 '25 06:09 hoxbro

CodSpeed Instrumentation Performance Report

Merging #1448 will improve performances by 11.6%

Comparing fix_dont_crash (3863e28) with main (f44670c)

Summary

⚡ 2 improvements
✅ 41 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_quadmesh_raster[256] 15.9 ms 14.3 ms +11.57%
test_dask_raster[8192] 3.9 s 3.5 s +11.6%

codspeed-hq[bot] avatar Sep 08 '25 06:09 codspeed-hq[bot]

I can recreate the issue with uv and will investigate; I'm not entirely sure what is causing this problem. Can you open an issue, so we don't forget about it, as it is out of scope for this PR.

It is because you also need to install pyarrow. I will consider how to improve this information, as it is not obvious.

hoxbro avatar Sep 08 '25 07:09 hoxbro