spatialpandas icon indicating copy to clipboard operation
spatialpandas copied to clipboard

DaskGeoDataFrame and datashader incompatibility

Open ahnsws opened this issue 7 months ago • 5 comments

Hello, I am running into an issue where using datashader on a DaskGeoDataFrame results in an error. To reproduce, I have the following poetry environment running on Ubuntu 22.04.5 LTS:

python = ">=3.12,<3.13"
spatialpandas = "0.5.0"
dask = "2025.3.0"
datashader = "0.17.0"
numpy = "2.1.3"

I followed this blog post from Holoviz to set up the DaskGeoDataFrame, and the code that generates the error is the below:

from pathlib import Path

from datashader import Canvas
from spatialpandas.dask import DaskGeoDataFrame
from spatialpandas.io import read_parquet_dask


def run():
    pq_file = Path(__file__).parent / "data" / "test.parq"

    gdf = read_parquet_dask(pq_file)
    assert isinstance(gdf, DaskGeoDataFrame)

    canvas = Canvas()
    canvas.points(gdf, geometry="geometry")


if __name__ == "__main__":
    run()

This gives the following error:

Traceback (most recent call last):
  File "2025-03-27_minimal.py", line 54, in <module>
    run()
  File "2025-03-27_minimal.py", line 50, in run
    canvas.points(gdf, geometry="geometry")
  File "/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/datashader/core.py", line 229, in points
    return bypixel(source, self, glyph, agg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/datashader/core.py", line 1351, in bypixel
    return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/datashader/utils.py", line 121, in __call__
    return lk[cls](head, *rest, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/datashader/data_libraries/dask.py", line 42, in dask_pipeline
    return da.compute(dsk, scheduler=scheduler)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/dask/base.py", line 656, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/dask/local.py", line 455, in get_async
    raise ValueError("Found no accessible jobs in dask")
ValueError: Found no accessible jobs in dask

Process finished with exit code 1

To get the code to work, I had to revert the packages to the following:

python = ">=3.12,<3.13"
spatialpandas = "0.4.10"
dask = "2024.12.1"
datashader = "0.17.0"
numpy = "1.26.4"

The only output now is a bunch of warnings:

/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/dask/dataframe/__init__.py:49: FutureWarning: 
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)
/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/spatialpandas/io/parquet.py:353: FutureWarning: Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 and will be removed in a future version.
  d = ParquetDataset(
/home/titanium/.cache/pypoetry/virtualenvs/sandbox-datashader2-_RrFaDUd-py3.12/lib/python3.12/site-packages/spatialpandas/io/parquet.py:137: FutureWarning: Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 and will be removed in a future version.
  dataset = ParquetDataset(
# the same warning is repeated many times

Process finished with exit code 0


I wasn't sure how to create an empty DaskGeoDataFrame, but the way I generated the parquet file was to download one of the csv files as mentioned in the above Holoviz blog post and use the below script:

from pathlib import Path

import dask.dataframe as dd
import numpy as np
from dask.diagnostics import ProgressBar
from spatialpandas import GeoDataFrame
from spatialpandas.geometry import PointArray


def lon_lat_to_easting_northing(longitude, latitude):
    # copied here to avoid dependency on holoviews
    origin_shift = np.pi * 6378137
    easting = longitude * origin_shift / 180.0
    with np.errstate(divide="ignore", invalid="ignore"):
        northing = (
            np.log(np.tan((90 + latitude) * np.pi / 360.0)) * origin_shift / np.pi
        )
    return easting, northing


def convert_partition(df):
    east, north = lon_lat_to_easting_northing(
        df["LON"].astype("float32"), df["LAT"].astype("float32")
    )
    return GeoDataFrame({"geometry": PointArray((east, north))})


def convert_csv_to_gdf():
    base_dir = Path(__file__).parent / "data"
    csv_files = base_dir / "AIS_2020_01*.csv"

    pq_file = base_dir / "test.parq"
    example = GeoDataFrame({"geometry": PointArray([], dtype="float32")})

    with ProgressBar():
        print("Reading csv files")
        gdf = dd.read_csv(csv_files, assume_missing=True)
        gdf = gdf.map_partitions(convert_partition, meta=example)

        print("Writing parquet file")
        gdf = gdf.pack_partitions_to_parquet(pq_file, npartitions=64)

    return gdf


if __name__ == "__main__":
    convert_csv_to_gdf()

using the below versions:

python = ">=3.12,<3.13"
spatialpandas = "0.4.10"
dask = "2024.12.1"
datashader = "0.17.0"
numpy = "1.26.4"

This is not exactly breaking, but it would be nice to be able to use updated packages. Thank you!

ahnsws avatar Mar 27 '25 17:03 ahnsws