Geopandas vs pandas for points
Points now use lazy dataframes (Dask DataFrame). We talked about allowing having in-memory both as dataframes and lazy dataframes. https://github.com/scverse/spatialdata/issues/153
What about using GeoDataFrame and Dask GeoDataFrame instead? This will allow for:
- lazy loading
- unify points and circles (circles are in reality points in which we add a radius column via the schema) https://github.com/scverse/spatialdata/issues/46
- spatial index
- the user has to convert to
GeoDataFrameanyway to exploit thegeopandasfunctions, these functions are cumbersome (see below), but even worst, the user may be tempted or may expect to save points in theSpatialDataobject asGeoDataFrame.
Drawbacks:
- performance could not be good enough because it seems that
GeoDataFrameis creating one Python object per row, we should aim at 500M-1B points. - more complex type for the user
- 3D only partially supported by
geopandas, we need to understand the implication. Maybe it's fine to have the user only being able to do queries withgeopandasfor the 2D component of the data and use our APIs for 3D queries. In the end, if we use Dask dataframes we still need to implement these queries, so at worst we can extra queries APIs for free fromgeopandas.
Functions to convert back and forth points represented as Dask dataframes and geopandas dataframes (edit: an improved version of these functions is now available in the library, but their use is still cumbersome func 1, func 2).
from dask.dataframe.core import DataFrame as DaskDataFrame
from geopandas import GeoDataFrame
def points_dask_dataframe_to_geopandas(points: DaskDataFrame) -> GeoDataFrame:
# let's ignore the z component here
points_gdf = GeoDataFrame(geometry=geopandas.points_from_xy(points["x"], points["y"]))
for c in points.columns:
points_gdf[c] = points[c]
return points_gdf
def points_geopandas_to_dask_dataframe(gdf: GeoDataFrame) -> DaskDataFrame:
# convert the GeoDataFrame to a Dask DataFrame
ddf = dd.from_pandas(gdf[gdf.columns.drop("geometry")], npartitions=1)
ddf["x"] = gdf.geometry.x
ddf["y"] = gdf.geometry.y
# parse
ddf = PointsModel.parse(ddf, coordinates={"x": "x", "y": "y"})
return ddf
Comment for the Basel hackathon
I would make some benchmarks that test the performance of loading the points in memory and do some basic operation comparing pandas vs dask dataframes vs purely in-memory geopandas dataframe. In particular, it could be that one can deal hundreds of millions of points with pandas, but not with geopandas; or it would be good to know when things break.
Note: #359 aims at investigating this further by comparing dask dataframe vs dask geopandas (in particular when spatial partitioning is used).
Practically, I would write the benchmarks without even needing to import spatialdata, as making a PR to support geopandas for points etc would take more time.
Operations to benchmark would be loading/computing the data, selection of points by a categorical value (e.g. all points for a gene), and selection of points by spatial location (e.g. bounding box query).
dask-geopandas does not have feature parity with geopandas (https://github.com/geopandas/dask-geopandas/issues/130). It has not yet implemented much other than IO, especially it lacks nearest-neighbor search (which was my first most obvious try for comparing performance of the two).
- We would have to check whether our currently used geopandas features are already implemented. Still, it could limit us adding features in future.
- Otherwise, we would have to help
dask-geopandasimplement the missing features.