sedona icon indicating copy to clipboard operation
sedona copied to clipboard

`dataframe_to_arrow` Returns a table that doesn't convert geopandas index correctly

Open petern48 opened this issue 5 months ago • 3 comments

A lot of text below, but I'll highlight the main difference first. Notice our version has extra nested [ ].

# Our dataframe_to_arrow returns the following column
geometry: [[0101...F03F],[0101...0040]]

# But geopandas returns this.
geometry: [[0101...F03F,0101...0040]]

This happens for the index column (__index_level_0__) too, which leads to it being misterpreted as a column instead of being read in as an index when calling gpd.GeoDataFrame.from_arrow()

# Sedona returns
   __index_level_0__     geometry
0                  1  POINT (1 1)
1                  2  POINT (2 2) 

# Geopandas returns this
      geometry
1  POINT (1 1)
2  POINT (2 2)

Full script and output below.

import geopandas as gpd
import sedona.geopandas as sgpd
from sedona.spark.geoarrow.geoarrow import dataframe_to_arrow

sgpd_df = sgpd.GeoDataFrame({"geometry": [Point(1, 1), Point(2, 2)]}, index=pd.Index([1, 2]))
spark_df = sgpd_df._internal.spark_frame.drop("__natural_order__")  # don't worry about this drop
sgpd_arrow = dataframe_to_arrow(spark_df)

gpd_df = gpd.GeoDataFrame({"geometry": [Point(1, 1), Point(2, 2)]}, index=pd.Index([1, 2]))
gpd_arrow = pa.table(gpd_df.to_arrow())
assert type(sgpd_arrow) == type(gpd_arrow) == pa.Table
print("SEDONA\n", sgpd_arrow, "\n")
gpd_df_from_sgpd_arrow = gpd.GeoDataFrame.from_arrow(sgpd_arrow)
print(gpd_df_from_sgpd_arrow, "\n")
print("GEOPANDAS\n", gpd_arrow, "\n")
gpd_df_from_gpd_arrow = gpd.GeoDataFrame.from_arrow(gpd_arrow)
print(gpd_df_from_gpd_arrow)
SEDONA
 pyarrow.Table
__index_level_0__: int64
geometry: extension<geoarrow.wkb<WkbType>>
----
__index_level_0__: [[1],[2]]
geometry: [[0101000000000000000000F03F000000000000F03F],[010100000000000000000000400000000000000040]] 

   __index_level_0__     geometry
0                  1  POINT (1 1)
1                  2  POINT (2 2) 

GEOPANDAS
 pyarrow.Table
geometry: extension<geoarrow.wkb<WkbType>>
__index_level_0__: int64
----
geometry: [[0101000000000000000000F03F000000000000F03F,010100000000000000000000400000000000000040]]
__index_level_0__: [[1,2]] 

      geometry
1  POINT (1 1)
2  POINT (2 2)

petern48 avatar Jul 22 '25 18:07 petern48