python-bigquery-pandas icon indicating copy to clipboard operation
python-bigquery-pandas copied to clipboard

Geometry column

Open andhuang-CLGX opened this issue 2 years ago • 4 comments

https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html lists to_gbq, but I think it just inherits pandas-gbq; is there any plans to support a geometry column here or should this be in geopandas?

Right now I think the geometry column gets converted to string in the schema

andhuang-CLGX avatar Sep 29 '21 23:09 andhuang-CLGX

Now that we have some Geometry support in the google-cloud-bigquery library, I think this is a reasonable request.

We may just need to pass geometry_as_object=True if available (necessary library versions are installed). https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.table.RowIterator.html#google.cloud.bigquery.table.RowIterator.to_dataframe

tswast avatar Sep 30 '21 17:09 tswast

I think I just hit a regression because of this. Loading a shapefile into BigQuery with geopandas worked back on August 5th, 2021 but now fails with

...
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas/core/frame.py", line 2054, in to_gbq
    gbq.to_gbq(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas/io/gbq.py", line 212, in to_gbq
    pandas_gbq.to_gbq(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/gbq.py", line 1148, in to_gbq
    connector.load_data(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/gbq.py", line 565, in load_data
    chunks = load.load_chunks(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/load.py", line 237, in load_chunks
    load_parquet(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/load.py", line 129, in load_parquet
    client.load_table_from_dataframe(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 2671, in load_table_from_dataframe
    _pandas_helpers.dataframe_to_parquet(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 591, in dataframe_to_parquet
    arrow_table = dataframe_to_arrow(dataframe, bq_schema)
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 534, in dataframe_to_arrow
    bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 292, in bq_to_arrow_array
    return pyarrow.Array.from_pandas(series, type=arrow_type)
  File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Input object was not a NumPy array

some of the conda-forge packages in use:

...
geopandas                 0.10.2             pyhd8ed1ab_1    conda-forge
geopandas-base            0.10.2             pyha770c72_1    conda-forge
google-cloud-bigquery     2.34.1             pyh6c4a22f_0    conda-forge
google-cloud-bigquery-core 2.34.1             pyh6c4a22f_0    conda-forge
google-cloud-bigquery-storage 2.11.0             pyh6c4a22f_0    conda-forge
google-cloud-bigquery-storage-core 2.11.0             pyh6c4a22f_0    conda-forge
pandas-gbq                0.17.1             pyh6c4a22f_0    conda-forge
pyarrow                   6.0.1           py39h4d6536f_5_cpu    conda-forge
shapely                   1.8.0            py39hbfbc381_5    conda-forge
...

Looks like geometry is getting mapped to a string here, and so all the juicy GEOGRAPHY-detecting code in python-bigquery gets ignored.

Is fixing this a matter of getting this mapping to map geopandas types to GEOGRAPHY? It looks like python-bigquery is doing something similar to detect pandas types here but is using dtype.name instead of dtype.kind as a key — that might be a more robust approach?

I might be able to put together a PR if I can get guidance or confirmation on this from someone in the know.

brews avatar Mar 04 '22 03:03 brews

I got good help elsewhere and wanted to follow up with a solution/workaround.

My issue can be resolved by converting geopandas' geometry column to a WKT column in a vanilla Pandas DataFrame and then passing the vanilla DataFrame to BQ.

    working_shapefile = pd.DataFrame({
        col: (shapefile[col] if col != 'geometry' else [g.wkt for g in shapefile[col].values])
        for col in shapefile.columns.values
    })

Sorry for the spam. Hopefully, this will help someone in the future.

brews avatar Mar 09 '22 21:03 brews

Here is some extra input which might be helpful in case geometry support will be added.

The solution by @brews worked for a few rows in my case, I needed to manually specify the schema to get it to parse the geometry column as GEOGRAPHY. This is what I used:

import geopandas as gpd
from google.cloud import bigquery

client = bigquery.Client()

table_id = "dataset.tablename"

df = gpd.read_file("my_file")

#determine schema
type_dict = {
    'b' : 'BOOLEAN',
    'i' : 'INTEGER',
    'f' : 'FLOAT',
    'O' : 'STRING',
    'S' : 'STRING',
    'U' : 'STRING'
}
schema = [{'name' : col_name, 'type' : "GEOGRAPHY" if col_name == "geometry" else type_dict.get(col_type.kind, 'STRING')} for (col_name, col_type) in df.dtypes.iteritems()]

#https://cloud.google.com/bigquery/docs/pandas-gbq-migration#loading_a_pandas_dataframe_to_a_table
job_config = bigquery.LoadJobConfig(schema=schema)
job = client.load_table_from_dataframe(
    df.to_wkt(), #same output as github issue solution
    table_id, 
    job_config=job_config
)
job.result()

This worked for a small dataset but for a larger set I quickly started running into all kinds of errors where bq would not accept the polygons:

Invalid polygon loop: Edge 462 has duplicate vertex with edge 554
Invalid nesting: loop 1 should not contain loop 0

I found a potential solution using geojson instead of wkt: https://stackoverflow.com/questions/62233152/uploading-to-bigquery-gis-invalid-nesting-loop-1-should-not-contain-loop-0

After quite some experimentation I found a way to create a df that seemed acceptable to bq similar to the solution above:

df_json = pd.DataFrame({
    col: (df[col] if col != 'geometry' else df[col].map(lambda x: json.dumps(shapely.geometry.mapping(x))))
    for col in df
})

This led to another similar error as before:

Invalid geography value for column 'geometry', error: Polygon's first loop must be shell. It is nested in loop 2

I then found that bq has an option to fix this type of data using make_valid=>true: https://gis.stackexchange.com/a/376870 Now I have a fully working solution, it is relatively simple but it requires overwriting the table after creation:

import geopandas as gpd
from google.cloud import bigquery

client = bigquery.Client()

table_id = "dataset.tablename"
df = gpd.read_file("my_file")

df.to_wkt().to_gbq(table_id,if_exists="replace")

cols = ",".join("st_geogfromtext(geometry, make_valid => TRUE) as geometry" if col == "geometry" else col for col in df)

query = f"CREATE OR REPLACE TABLE {table_id} AS SELECT {cols} FROM {table_id}"
# print(query)
query_job = client.query(query)

I thought I'd share this solution in case others have similar issues.

It would be nice if make_valid=True could somehow be set in df.to_gbq or in bigquery.LoadJobConfig, but I have no idea whether the architecture allows this.

EDIT: I learned that shapely also has make_valid, this still didn't work with to_wkt because of the issue mentioned before but it does work with the geojson solution. This solution worked for all the data I tested as well:

import geopandas as gpd
import json
import shapely
from shapely.validation import make_valid
from google.cloud import bigquery

client = bigquery.Client()

table_id = "dataset.tablename"

df = gpd.read_file("my_file")

#determine schema
type_dict = {
    'b' : 'BOOLEAN',
    'i' : 'INTEGER',
    'f' : 'FLOAT',
    'O' : 'STRING',
    'S' : 'STRING',
    'U' : 'STRING'
}
schema = [{'name' : col_name, 'type' : "GEOGRAPHY" if col_name == "geometry" else type_dict.get(col_type.kind, 'STRING')} for (col_name, col_type) in df.dtypes.iteritems()]

df_json = pd.DataFrame({
    col: (df[col] if col != 'geometry' else df[col].map(lambda x: ujson.dumps(shapely.geometry.mapping(make_valid(x)))))
    for col in df
})

job_config = bigquery.LoadJobConfig(schema=schema)
job = client.load_table_from_dataframe(
    df_json,
    table_id, 
    job_config=job_config
)
job.result()

Riezebos avatar Jun 02 '22 11:06 Riezebos