examples icon indicating copy to clipboard operation
examples copied to clipboard

Add daskified example to bay_trimesh?

Open rsignell-usgs opened this issue 4 years ago • 2 comments

I see that trimesh has support for dask from the table in the datashader performance page: 2020-02-27_11-47-09 and there also an datashader trimesh notebook with a dask example

It would be great to add a similar dask example after the last cell in: https://github.com/pyviz-topics/examples/blob/master/bay_trimesh/bay_trimesh.ipynb to basically dask-enable this call:

datashade(hv.TriMesh((tris, points)), aggregator=ds.mean('z'), precompute=True)

When I try it, I get:

NotImplementedError: Dask dataframe does not support assigning non-scalar value.

My full notebook is here:
https://nbviewer.jupyter.org/gist/rsignell-usgs/cf853d43fd5e53ba90fbd4e0b9eb3da7

rsignell-usgs avatar Feb 28 '20 15:02 rsignell-usgs

It's failing at https://github.com/holoviz/holoviews/blob/4dcddcc3a8a278dea550147def62f46ecd3d5d1d/holoviews/core/data/dask.py#L272-L277, where holoviews is trying to insert a numpy array into a dask dataframe.

df['index'] = np.arange(n)

but that would fail. We need to wrap the NumPy array in a Dask Array before inserting it. The tricky point is making sure that the chunks of the array matches the partitions of the dataframe.

TomAugspurger avatar Feb 28 '20 15:02 TomAugspurger

Here's the basic idea

In [1]: import dask.dataframe as dd

In [2]: import dask

In [4]: import dask.array as da

In [5]: import numpy as np

In [6]: ts = dask.datasets.timeseries(end='2000-01-05')

In [7]: ts
Out[7]:
Dask DataFrame Structure:
                  id    name        x        y
npartitions=4
2000-01-01     int64  object  float64  float64
2000-01-02       ...     ...      ...      ...
2000-01-03       ...     ...      ...      ...
2000-01-04       ...     ...      ...      ...
2000-01-05       ...     ...      ...      ...
Dask Name: make-timeseries, 4 tasks

In [8]: index = np.arange(len(ts))

In [9]: chunks = tuple(ts.map_partitions(len).compute())

In [10]: ts['index'] = da.from_array(index, chunks)

In [11]: ts
Out[11]:
Dask DataFrame Structure:
                  id    name        x        y  index
npartitions=4
2000-01-01     int64  object  float64  float64  int64
2000-01-02       ...     ...      ...      ...    ...
2000-01-03       ...     ...      ...      ...    ...
2000-01-04       ...     ...      ...      ...    ...
2000-01-05       ...     ...      ...      ...    ...
Dask Name: assign, 21 tasks

The unfortunate part is the chunks = tuple(ts.map_partitions(len).compute()). I think that's unavoidable though, since we need the chunks to align... Hopefully that's not too expensive (or maybe holoviews already gets the chunk sizes elsewhere?)

TomAugspurger avatar Feb 28 '20 15:02 TomAugspurger