geocube icon indicating copy to clipboard operation
geocube copied to clipboard

Support dask-geopandas

Open davidbrochart opened this issue 2 years ago • 3 comments

@snowman2 mentioned using dask-geopandas to rasterize in chunks using Dask. I just tried and it doesn't seem to work out of the box. Here is an example:

from functools import partial

import geopandas
import dask_geopandas
from rasterio.enums import MergeAlg
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

# download: wget https://raw.githubusercontent.com/drei01/geojson-world-cities/master/cities.geojson

df = geopandas.read_file(
    "cities.geojson",
    crs="epsg:4326",
)

ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf["mask"] = 1

geo_grid = make_geocube(
    vector_data=ddf,
    resolution=(-0.01, 0.01),
    measurements=["mask"],
    rasterize_function=partial(rasterize_image, merge_alg=MergeAlg.add, all_touched=True),
    fill=0,
)
Traceback
---------------------------------------------------------------------------
VectorDataError                           Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 geo_grid = make_geocube(
      2     vector_data=ddf,
      3     resolution=(-0.05, 0.05),
      4     measurements=['mask'],
      5     rasterize_function=partial(rasterize_image, merge_alg=MergeAlg.add, all_touched=True),
      6     fill=0,
      7 )
      8 geo_grid

File ~/mambaforge/envs/xarray_leaflet/lib/python3.10/site-packages/geocube/api/core.py:99, in make_geocube(vector_data, measurements, datetime_measurements, output_crs, resolution, align, geom, like, fill, group_by, interpolate_na_method, categorical_enums, rasterize_function)
     35 """
     36 Rasterize vector data into an ``xarray`` object.  Each attribute will be a data
     37 variable in the :class:`xarray.Dataset`.
   (...)
     95 
     96 """
     97 geobox_maker = GeoBoxMaker(output_crs, resolution, align, geom, like)
---> 99 return VectorToCube(
    100     vector_data=vector_data,
    101     geobox_maker=geobox_maker,
    102     fill=fill,
    103     categorical_enums=categorical_enums,
    104 ).make_geocube(
    105     measurements=measurements,
    106     datetime_measurements=datetime_measurements,
    107     group_by=group_by,
    108     interpolate_na_method=interpolate_na_method,
    109     rasterize_function=rasterize_function,
    110 )

File ~/mambaforge/envs/xarray_leaflet/lib/python3.10/site-packages/geocube/vector_to_cube.py:79, in VectorToCube.__init__(self, vector_data, geobox_maker, fill, categorical_enums)
     53 def __init__(
     54     self,
     55     vector_data: Union[str, os.PathLike, geopandas.GeoDataFrame],
   (...)
     58     categorical_enums: Optional[Dict[str, List]],
     59 ):
     60     """
     61     Initialize the GeoCube class.
     62 
   (...)
     77 
     78     """
---> 79     self._vector_data = load_vector_data(vector_data)
     80     self._geobox = geobox_maker.from_vector(self._vector_data)
     81     self._grid_coords = affine_to_coords(
     82         self._geobox.affine, self._geobox.width, self._geobox.height
     83     )

File ~/mambaforge/envs/xarray_leaflet/lib/python3.10/site-packages/geocube/geo_utils/geobox.py:73, in load_vector_data(vector_data)
     71     raise VectorDataError("Empty GeoDataFrame.")
     72 if "geometry" not in vector_data.columns:
---> 73     raise VectorDataError(
     74         "'geometry' column missing. Columns in file: "
     75         f"{vector_data.columns.values.tolist()}"
     76     )
     78 # make sure projection is set
     79 if not vector_data.crs:

VectorDataError: 'geometry' column missing. Columns in file: [0, 1, 2]

davidbrochart avatar Jul 29 '22 14:07 davidbrochart

Sounds like a good idea. A MR adding support for dask-geopandas is welcome.

snowman2 avatar Aug 02 '22 17:08 snowman2

I question whether this is the best approach as my (admittedly very limited) understanding is that a dask-geopandas dataframe is chunked along an index rather than into pieces of equivalent (spatial) size, so a single large feature could blow things up.

jessjaco avatar Sep 21 '22 20:09 jessjaco

I think that's a fair concern, @jessjaco . I suspect an important preliminary step would be to perform a spatial_shuffle on the dask-geopandas GeoDataframe. I think it's not so much an issue of single large features, but more how your features are distributed over the larger region. Of course a single large feature could still trip you up, but perhaps that's then down to the user to manage as only they would know how they'd want to subdivide their features.

gtmaskall avatar Jan 04 '23 14:01 gtmaskall