aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Add support for geospatial queries in Athena
AWS Athena provides support for querying geospatial data. The response can contain specific geometric datatypes, which are not presently supported by AWS Wrangler/Pandas.
Geopandas is a geospatial-aware version of Pandas. In theory it should be possible use Geopandas to handle geospatial queries in AWS Wrangler.
This would be extremely helpful when working with large amounts of geospatial data via Athena.
Thanks, looks like geopandas would handle the heavy lifting by extending OOB pandas data types. I'll have a look but on the first glance looks like we'll just need to update datatype mappings
Link https://github.com/geopandas/geopandas/issues/680
I've had some time to look into how this could work.
On a basic level, the only thing necessary in this library is an addition to the data types, as you said. So in awswrangler/_data_types.py, in the athena2pandas function add something like:
if dtype == "geometry":
return "geometry"
The rest of it could be done simply in the implementation. Geopandas does not have an implementation of _from_sequence_of_strings, so would need to add this somehow:
from geopandas.array import GeometryArray
from shapely.wkt import loads as shapely_loads
def geopandas_from_sequence_of_strings(cls, strings, dtype, copy: bool = False):
return cls._from_sequence([shapely_loads(s) for s in strings], dtype=dtype, copy=copy )
GeometryArray._from_sequence_of_strings = classmethod(geopandas_from_sequence_of_strings)
I'm not certain if this is an addition which I should try and get added to Geopandas itself; seems like an implementation somewhat specific to this use-case.
Lastly with the above in place, you can just use the normal AWSWrangler API to query Athena, and load the result into a GeoDataFrame:
import awswrangler as wr
import geopandas as gpd
my_query = "SELECT * FROM..." # some GIS query
my_database = "mydb"
df = wr.athena.read_sql_query(sql=my_query,
database=my_database,
ctas_approach=False) # Think it only works with this approach
gdf = gpd.GeoDataFrame(df, geometry="my_gis_column_name")
So other than the type addition at the top, I'm not sure now if all of this needs to be in AWSWrangler.
Thoughts?
I've gotten this to work by going from binary shapefile to geometry to WKT representation in athena, and then using shapely to parse the WKT back to a geometry. It's weird, but it doesn't perform too badly.
Selecting binary geometry:
ST_asText(geometry_from_hadoop_shape(geometry)) as geometry
Querying and getting a dataframe with wr.athena.read_sql_query()
Calling shapely.from_wkt with pandas: df["geometry"].apply(lambda x: from_wkt(x))
Loading into geopandas: geopandas.GeoDataFrame(df)
e.g.
from shapely import from_wkt
import awswrangler as wr
import geopandas as gpd
q = 'select ST_asText(geometry_from_hadoop_shape(geometry)) as geometry from table'
df = wr.athena.read_sql_query(q, database='datalake')
df['geometry'] = df["geometry"].apply(lambda x: from_wkt(x))
gdf = gpd.GeoDataFrame(df)
Hope this helps someone as a workaround
HI @colelloa thanks, there is a draft PR open for a while with a similar approach: https://github.com/aws/aws-sdk-pandas/pull/2346
Merged.