fix: Improve `compute_chunksize` for downsampled data
Previously, the following code would fill out all my memory, and then freeze, ending up with the system either killing the process or I needed to do a manual shutdown.
I haven't done any profiling to see if this affects performance, but at least it doesn't crash my computer anymore. If it does, we can maybe move the functionality into resample_2d_distributed.
import numpy as np
import dask.array as da
import datashader as ds
import xarray as xr
import dask
print(dask.__version__)
# create large dask array
N = 100_000
dask_array = da.random.random((N, N), chunks=(1000, 1000)) # .compute()
# convert to dasked xarray
dask_xarray = xr.DataArray(
dask_array,
dims=["x", "y"],
coords={"x": np.arange(N), "y": np.arange(N)},
name="example_data", # Name of the data variable
)
# create plot using Datashader
arr = ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray)
arr.compute()
First reported here: https://stackoverflow.com/questions/79753007/high-ram-usage-when-using-datashader-with-dasked-xarray
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 88.34%. Comparing base (f44670c) to head (3863e28).
Additional details and impacted files
@@ Coverage Diff @@
## main #1448 +/- ##
=======================================
Coverage 88.33% 88.34%
=======================================
Files 96 96
Lines 18901 18905 +4
=======================================
+ Hits 16696 16701 +5
+ Misses 2205 2204 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Hi, I am the original poster in the Stack Overflow question you have linked. Thank you for looking into this. I installed Datashader from the branch you had created. I ran the same code but I am now getting a different error. See below.
PS: Sorry, not sure how to properly format the error message here.
Traceback (most recent call last)
Cell In[3], line 2
1 # create plot using Datashader
----> 2 tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))
File D:\ProgramData\environments\test_ds\lib\site-packages\datashader\core.py:1155, in Canvas.raster(self, source, layer, upsample_method, downsample_method, nan_value, agg, interpolate, chunksize, max_mem)
1151 data = resample_2d_distributed(
1152 source_window, chunksize=chunksize, max_mem=max_mem,
1153 **kwargs)
1154 else:
-> 1155 data = resample_2d(source_window, **kwargs)
1156 layers = 1
1157 else:
File D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py:347, in resample_2d(src, w, h, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
344 if isinstance(src, np.ma.MaskedArray):
345 src = src.data
--> 347 resampled = _resample_2d(src, mask, use_mask, ds_method, us_method,
348 fill_value, mode_rank, x_offset, y_offset, out)
349 return _mask_or_not(resampled, src, fill_value)
File D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py:499, in _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
497 def _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value,
498 mode_rank, x_offset, y_offset, out):
--> 499 src_w, src_h, out_w, out_h = _get_dimensions(src, out)
500 x0_off, x1_off = x_offset
501 y0_off, y1_off = y_offset
File D:\ProgramData\environments\test_ds\lib\site-packages\numba\core\dispatcher.py:424, in _DispatcherBase._compile_for_args(self, *args, **kws)
420 msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
421 f"by the following argument(s):\n{args_str}\n")
422 e.patch_message(msg)
--> 424 error_rewrite(e, 'typing')
425 except errors.UnsupportedError as e:
426 # Something unsupported is present in the user code, add help info
427 error_rewrite(e, 'unsupported_error')
File D:\ProgramData\environments\test_ds\lib\site-packages\numba\core\dispatcher.py:365, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
363 raise e
364 else:
--> 365 raise e.with_traceback(None)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py (488)
File "D:\ProgramData\environments\test_ds\lib\site-packages\datashader\resampling.py", line 488:
def _get_fill_value(fill_value, src, out):
<source elided>
@ngjit
^
During: Pass nopython_type_inference
This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'dask.array.core.Array'>
I cannot recreate.
Looking at the line numbers, it could be because you haven't installed it correctly, installed it into the right environment, or haven't restarted the notebook.
An example is that resampling.py:499 lines text in you traceback does not match up with this branch.
Sorry, I copied the wrong error message. I had created a new environment where I installed the latest version of all the required packages. I then manually copied over your changes in the resamply.py script (for some reason I couldn't install the Datashader directly from your GitHub branch). When I ran the code I got the same @ngjit decorater error as above. The error message from that run is at the end of this comment. I also created another new environment with the latest packages and without your change. I ran the same code and I got the same @ngjit decorater error (that error message is the one I mistakenly copied in my earlier message). So this seems to be a separate issue? I will open up a separate ticket for this.
Anyway, I created another new environment and this time I installed Datashader 0.16.1 which is the version that I was using in my original environment. I then manually changed the resamplying.py script for your changes. I can now run the code without any errors. The RAM increases by about 8gb and it takes about 1-2 minutes to run. So happy to confirm that the issue is solved. Thanks for your help!
I just also have some related general questions for my own understanding. Would be much appreciated if you had the time to answer them.
- Does the chunk size that I set in my dasked array determine in anyway the chunk size used in Datashader? Based on the error I had and the changes you have made, it seems like Datashader determines its own suitable chunk size?
- Depending on the above answer, is there a rough guideline on how much RAM I need available when creating a plot in Datashader using dask? For example, would I need to have 5 times the memory size of a single chunk?
- Are you aware of any tutorials that explain how to use Datashader and dask together (even better if it also uses Holoviews/Geoviews)? I have seen a couple of guides, but they are all quite minimal examples and don't go into any details/explanation into how dask is being used. I am very fuzzy with using dask, so it would be great to better understand things such has knowing how much workers and threads to use, how to set various memory limits, etc.
Many thanks!
TypingError Traceback (most recent call last)
Cell In[3], line 2
1 # create plot using Datashader
----> 2 tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))
File D:\ProgramData\environments\ds_test\lib\site-packages\datashader\core.py:1155, in Canvas.raster(self, source, layer, upsample_method, downsample_method, nan_value, agg, interpolate, chunksize, max_mem)
1151 data = resample_2d_distributed(
1152 source_window, chunksize=chunksize, max_mem=max_mem,
1153 **kwargs)
1154 else:
-> 1155 data = resample_2d(source_window, **kwargs)
1156 layers = 1
1157 else:
File D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py:353, in resample_2d(src, w, h, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
350 if isinstance(src, np.ma.MaskedArray):
351 src = src.data
--> 353 resampled = _resample_2d(src, mask, use_mask, ds_method, us_method,
354 fill_value, mode_rank, x_offset, y_offset, out)
355 return _mask_or_not(resampled, src, fill_value)
File D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py:505, in _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value, mode_rank, x_offset, y_offset, out)
503 def _resample_2d(src, mask, use_mask, ds_method, us_method, fill_value,
504 mode_rank, x_offset, y_offset, out):
--> 505 src_w, src_h, out_w, out_h = _get_dimensions(src, out)
506 x0_off, x1_off = x_offset
507 y0_off, y1_off = y_offset
File D:\ProgramData\environments\ds_test\lib\site-packages\numba\core\dispatcher.py:424, in _DispatcherBase._compile_for_args(self, *args, **kws)
420 msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
421 f"by the following argument(s):\n{args_str}\n")
422 e.patch_message(msg)
--> 424 error_rewrite(e, 'typing')
425 except errors.UnsupportedError as e:
426 # Something unsupported is present in the user code, add help info
427 error_rewrite(e, 'unsupported_error')
File D:\ProgramData\environments\ds_test\lib\site-packages\numba\core\dispatcher.py:365, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
363 raise e
364 else:
--> 365 raise e.with_traceback(None)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py (494)
File "D:\ProgramData\environments\ds_test\lib\site-packages\datashader\resampling.py", line 494:
def _get_fill_value(fill_value, src, out):
<source elided>
@ngjit
^
During: Pass nopython_type_inference
This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'dask.array.core.Array'>
This pull request has been mentioned on HoloViz Discourse. There might be relevant details there:
https://discourse.holoviz.org/t/error-when-creating-plot-from-dask-xarray-using-latest-package-versions/8926/1
I also created another new environment with the latest packages and without your change. I ran the same code and I got the same @ngjit decorater error (that error message is the one I mistakenly copied in my earlier message). So this seems to be a separate issue? I will open up a separate ticket for this.
I can recreate the issue with uv and will investigate; I'm not entirely sure what is causing this problem. Can you open an issue, so we don't forget about it, as it is out of scope for this PR.
uv venv --python 3.13
uv pip install dask datashader
For your questions, dask should generally be smart enough to max out your memory. Not sure if the problem lies in datashader or dask, but what I'm doing so far in the PR is changing the chunk size to avoid the memory.
CodSpeed Instrumentation Performance Report
Merging #1448 will improve performances by 11.6%
Comparing fix_dont_crash (3863e28) with main (f44670c)
Summary
⚡ 2 improvements
✅ 41 untouched benchmarks
Benchmarks breakdown
| Benchmark | BASE |
HEAD |
Change | |
|---|---|---|---|---|
| ⚡ | test_quadmesh_raster[256] |
15.9 ms | 14.3 ms | +11.57% |
| ⚡ | test_dask_raster[8192] |
3.9 s | 3.5 s | +11.6% |
I can recreate the issue with uv and will investigate; I'm not entirely sure what is causing this problem. Can you open an issue, so we don't forget about it, as it is out of scope for this PR.
It is because you also need to install pyarrow. I will consider how to improve this information, as it is not obvious.