python-bigquery-pandas
python-bigquery-pandas copied to clipboard
Geometry column
https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html lists to_gbq, but I think it just inherits pandas-gbq; is there any plans to support a geometry column here or should this be in geopandas?
Right now I think the geometry column gets converted to string in the schema
Now that we have some Geometry support in the google-cloud-bigquery
library, I think this is a reasonable request.
We may just need to pass geometry_as_object=True
if available (necessary library versions are installed). https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.table.RowIterator.html#google.cloud.bigquery.table.RowIterator.to_dataframe
I think I just hit a regression because of this. Loading a shapefile into BigQuery with geopandas worked back on August 5th, 2021 but now fails with
...
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas/core/frame.py", line 2054, in to_gbq
gbq.to_gbq(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas/io/gbq.py", line 212, in to_gbq
pandas_gbq.to_gbq(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/gbq.py", line 1148, in to_gbq
connector.load_data(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/gbq.py", line 565, in load_data
chunks = load.load_chunks(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/load.py", line 237, in load_chunks
load_parquet(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/load.py", line 129, in load_parquet
client.load_table_from_dataframe(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 2671, in load_table_from_dataframe
_pandas_helpers.dataframe_to_parquet(
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 591, in dataframe_to_parquet
arrow_table = dataframe_to_arrow(dataframe, bq_schema)
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 534, in dataframe_to_arrow
bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 292, in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Input object was not a NumPy array
some of the conda-forge packages in use:
...
geopandas 0.10.2 pyhd8ed1ab_1 conda-forge
geopandas-base 0.10.2 pyha770c72_1 conda-forge
google-cloud-bigquery 2.34.1 pyh6c4a22f_0 conda-forge
google-cloud-bigquery-core 2.34.1 pyh6c4a22f_0 conda-forge
google-cloud-bigquery-storage 2.11.0 pyh6c4a22f_0 conda-forge
google-cloud-bigquery-storage-core 2.11.0 pyh6c4a22f_0 conda-forge
pandas-gbq 0.17.1 pyh6c4a22f_0 conda-forge
pyarrow 6.0.1 py39h4d6536f_5_cpu conda-forge
shapely 1.8.0 py39hbfbc381_5 conda-forge
...
Looks like geometry
is getting mapped to a string here, and so all the juicy GEOGRAPHY
-detecting code in python-bigquery
gets ignored.
Is fixing this a matter of getting this mapping to map geopandas types to GEOGRAPHY
? It looks like python-bigquery
is doing something similar to detect pandas types here but is using dtype.name
instead of dtype.kind
as a key — that might be a more robust approach?
I might be able to put together a PR if I can get guidance or confirmation on this from someone in the know.
I got good help elsewhere and wanted to follow up with a solution/workaround.
My issue can be resolved by converting geopandas' geometry
column to a WKT column in a vanilla Pandas DataFrame and then passing the vanilla DataFrame to BQ.
working_shapefile = pd.DataFrame({
col: (shapefile[col] if col != 'geometry' else [g.wkt for g in shapefile[col].values])
for col in shapefile.columns.values
})
Sorry for the spam. Hopefully, this will help someone in the future.
Here is some extra input which might be helpful in case geometry support will be added.
The solution by @brews worked for a few rows in my case, I needed to manually specify the schema to get it to parse the geometry column as GEOGRAPHY. This is what I used:
import geopandas as gpd
from google.cloud import bigquery
client = bigquery.Client()
table_id = "dataset.tablename"
df = gpd.read_file("my_file")
#determine schema
type_dict = {
'b' : 'BOOLEAN',
'i' : 'INTEGER',
'f' : 'FLOAT',
'O' : 'STRING',
'S' : 'STRING',
'U' : 'STRING'
}
schema = [{'name' : col_name, 'type' : "GEOGRAPHY" if col_name == "geometry" else type_dict.get(col_type.kind, 'STRING')} for (col_name, col_type) in df.dtypes.iteritems()]
#https://cloud.google.com/bigquery/docs/pandas-gbq-migration#loading_a_pandas_dataframe_to_a_table
job_config = bigquery.LoadJobConfig(schema=schema)
job = client.load_table_from_dataframe(
df.to_wkt(), #same output as github issue solution
table_id,
job_config=job_config
)
job.result()
This worked for a small dataset but for a larger set I quickly started running into all kinds of errors where bq would not accept the polygons:
Invalid polygon loop: Edge 462 has duplicate vertex with edge 554
Invalid nesting: loop 1 should not contain loop 0
I found a potential solution using geojson instead of wkt: https://stackoverflow.com/questions/62233152/uploading-to-bigquery-gis-invalid-nesting-loop-1-should-not-contain-loop-0
After quite some experimentation I found a way to create a df that seemed acceptable to bq similar to the solution above:
df_json = pd.DataFrame({
col: (df[col] if col != 'geometry' else df[col].map(lambda x: json.dumps(shapely.geometry.mapping(x))))
for col in df
})
This led to another similar error as before:
Invalid geography value for column 'geometry', error: Polygon's first loop must be shell. It is nested in loop 2
I then found that bq has an option to fix this type of data using make_valid=>true
: https://gis.stackexchange.com/a/376870
Now I have a fully working solution, it is relatively simple but it requires overwriting the table after creation:
import geopandas as gpd
from google.cloud import bigquery
client = bigquery.Client()
table_id = "dataset.tablename"
df = gpd.read_file("my_file")
df.to_wkt().to_gbq(table_id,if_exists="replace")
cols = ",".join("st_geogfromtext(geometry, make_valid => TRUE) as geometry" if col == "geometry" else col for col in df)
query = f"CREATE OR REPLACE TABLE {table_id} AS SELECT {cols} FROM {table_id}"
# print(query)
query_job = client.query(query)
I thought I'd share this solution in case others have similar issues.
It would be nice if make_valid=True
could somehow be set in df.to_gbq
or in bigquery.LoadJobConfig
, but I have no idea whether the architecture allows this.
EDIT: I learned that shapely also has make_valid, this still didn't work with to_wkt
because of the issue mentioned before but it does work with the geojson solution. This solution worked for all the data I tested as well:
import geopandas as gpd
import json
import shapely
from shapely.validation import make_valid
from google.cloud import bigquery
client = bigquery.Client()
table_id = "dataset.tablename"
df = gpd.read_file("my_file")
#determine schema
type_dict = {
'b' : 'BOOLEAN',
'i' : 'INTEGER',
'f' : 'FLOAT',
'O' : 'STRING',
'S' : 'STRING',
'U' : 'STRING'
}
schema = [{'name' : col_name, 'type' : "GEOGRAPHY" if col_name == "geometry" else type_dict.get(col_type.kind, 'STRING')} for (col_name, col_type) in df.dtypes.iteritems()]
df_json = pd.DataFrame({
col: (df[col] if col != 'geometry' else df[col].map(lambda x: ujson.dumps(shapely.geometry.mapping(make_valid(x)))))
for col in df
})
job_config = bigquery.LoadJobConfig(schema=schema)
job = client.load_table_from_dataframe(
df_json,
table_id,
job_config=job_config
)
job.result()