aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Add support for geospatial queries in Athena

Open ghinch opened this issue 3 years ago • 1 comments

AWS Athena provides support for querying geospatial data. The response can contain specific geometric datatypes, which are not presently supported by AWS Wrangler/Pandas.

Geopandas is a geospatial-aware version of Pandas. In theory it should be possible use Geopandas to handle geospatial queries in AWS Wrangler.

This would be extremely helpful when working with large amounts of geospatial data via Athena.

ghinch avatar Aug 30 '22 15:08 ghinch

Thanks, looks like geopandas would handle the heavy lifting by extending OOB pandas data types. I'll have a look but on the first glance looks like we'll just need to update datatype mappings

kukushking avatar Sep 06 '22 12:09 kukushking

Link https://github.com/geopandas/geopandas/issues/680

kukushking avatar Dec 06 '22 13:12 kukushking

I've had some time to look into how this could work.

On a basic level, the only thing necessary in this library is an addition to the data types, as you said. So in awswrangler/_data_types.py, in the athena2pandas function add something like:

if dtype == "geometry":
    return "geometry"

The rest of it could be done simply in the implementation. Geopandas does not have an implementation of _from_sequence_of_strings, so would need to add this somehow:

from geopandas.array import GeometryArray
from shapely.wkt import loads as shapely_loads

def geopandas_from_sequence_of_strings(cls, strings, dtype, copy: bool = False):
    return cls._from_sequence([shapely_loads(s) for s in strings], dtype=dtype, copy=copy )

GeometryArray._from_sequence_of_strings = classmethod(geopandas_from_sequence_of_strings)

I'm not certain if this is an addition which I should try and get added to Geopandas itself; seems like an implementation somewhat specific to this use-case.

Lastly with the above in place, you can just use the normal AWSWrangler API to query Athena, and load the result into a GeoDataFrame:

import awswrangler as wr
import geopandas as gpd

my_query = "SELECT * FROM..." # some GIS query
my_database = "mydb"

df = wr.athena.read_sql_query(sql=my_query,
                              database=my_database,
                              ctas_approach=False) # Think it only works with this approach

gdf = gpd.GeoDataFrame(df, geometry="my_gis_column_name")

So other than the type addition at the top, I'm not sure now if all of this needs to be in AWSWrangler.

Thoughts?

ghinch avatar Dec 21 '22 13:12 ghinch

I've gotten this to work by going from binary shapefile to geometry to WKT representation in athena, and then using shapely to parse the WKT back to a geometry. It's weird, but it doesn't perform too badly.

Selecting binary geometry: ST_asText(geometry_from_hadoop_shape(geometry)) as geometry

Querying and getting a dataframe with wr.athena.read_sql_query()

Calling shapely.from_wkt with pandas: df["geometry"].apply(lambda x: from_wkt(x))

Loading into geopandas: geopandas.GeoDataFrame(df)

e.g.

from shapely import from_wkt
import awswrangler as wr
import geopandas as gpd

q = 'select ST_asText(geometry_from_hadoop_shape(geometry)) as geometry from table'
df = wr.athena.read_sql_query(q, database='datalake')
df['geometry'] = df["geometry"].apply(lambda x: from_wkt(x))
gdf = gpd.GeoDataFrame(df)

Hope this helps someone as a workaround

colelloa avatar Aug 02 '23 04:08 colelloa

HI @colelloa thanks, there is a draft PR open for a while with a similar approach: https://github.com/aws/aws-sdk-pandas/pull/2346

kukushking avatar Aug 02 '23 08:08 kukushking

Merged.

kukushking avatar Sep 06 '23 09:09 kukushking