spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Geopandas vs pandas for points

Open LucaMarconato opened this issue 2 years ago • 2 comments

Points now use lazy dataframes (Dask DataFrame). We talked about allowing having in-memory both as dataframes and lazy dataframes. https://github.com/scverse/spatialdata/issues/153

What about using GeoDataFrame and Dask GeoDataFrame instead? This will allow for:

  • lazy loading
  • unify points and circles (circles are in reality points in which we add a radius column via the schema) https://github.com/scverse/spatialdata/issues/46
  • spatial index
  • the user has to convert to GeoDataFrame anyway to exploit the geopandas functions, these functions are cumbersome (see below), but even worst, the user may be tempted or may expect to save points in the SpatialData object as GeoDataFrame.

Drawbacks:

  • performance could not be good enough because it seems that GeoDataFrame is creating one Python object per row, we should aim at 500M-1B points.
  • more complex type for the user
  • 3D only partially supported by geopandas, we need to understand the implication. Maybe it's fine to have the user only being able to do queries with geopandas for the 2D component of the data and use our APIs for 3D queries. In the end, if we use Dask dataframes we still need to implement these queries, so at worst we can extra queries APIs for free from geopandas.

Functions to convert back and forth points represented as Dask dataframes and geopandas dataframes (edit: an improved version of these functions is now available in the library, but their use is still cumbersome func 1, func 2).

from dask.dataframe.core import DataFrame as DaskDataFrame
from geopandas import GeoDataFrame


def points_dask_dataframe_to_geopandas(points: DaskDataFrame) -> GeoDataFrame:
    # let's ignore the z component here
    points_gdf = GeoDataFrame(geometry=geopandas.points_from_xy(points["x"], points["y"]))
    for c in points.columns:
        points_gdf[c] = points[c]
    return points_gdf


def points_geopandas_to_dask_dataframe(gdf: GeoDataFrame) -> DaskDataFrame:
    # convert the GeoDataFrame to a Dask DataFrame
    ddf = dd.from_pandas(gdf[gdf.columns.drop("geometry")], npartitions=1)
    ddf["x"] = gdf.geometry.x
    ddf["y"] = gdf.geometry.y
    # parse
    ddf = PointsModel.parse(ddf, coordinates={"x": "x", "y": "y"})
    return ddf

LucaMarconato avatar Apr 27 '23 11:04 LucaMarconato

Comment for the Basel hackathon

I would make some benchmarks that test the performance of loading the points in memory and do some basic operation comparing pandas vs dask dataframes vs purely in-memory geopandas dataframe. In particular, it could be that one can deal hundreds of millions of points with pandas, but not with geopandas; or it would be good to know when things break.

Note: #359 aims at investigating this further by comparing dask dataframe vs dask geopandas (in particular when spatial partitioning is used).

Practically, I would write the benchmarks without even needing to import spatialdata, as making a PR to support geopandas for points etc would take more time.

Operations to benchmark would be loading/computing the data, selection of points by a categorical value (e.g. all points for a gene), and selection of points by spatial location (e.g. bounding box query).

LucaMarconato avatar Oct 31 '24 16:10 LucaMarconato

dask-geopandas does not have feature parity with geopandas (https://github.com/geopandas/dask-geopandas/issues/130). It has not yet implemented much other than IO, especially it lacks nearest-neighbor search (which was my first most obvious try for comparing performance of the two).

  • We would have to check whether our currently used geopandas features are already implemented. Still, it could limit us adding features in future.
  • Otherwise, we would have to help dask-geopandas implement the missing features.

aeisenbarth avatar Mar 25 '25 14:03 aeisenbarth