geopandas icon indicating copy to clipboard operation
geopandas copied to clipboard

BUG: .to_postgis() unwantedly convert empty strings as NULL

Open ygversil opened this issue 2 years ago • 2 comments

Problem description

When trying to append a GeoDataFrame into a Postgis table, empty strings are all converted to NULL values. This is not always wanted.

For example, say you have a simple Postgis table with a comments column having type text, with constraint not null (design decisions may be poor, but I don't have permissions to change table definition. In this design, having no comment should be represented as an empty string, not as a null value).

create table my_point_layer (
  id bigint primary key,
  info smallint,
  comments text not null default '',
  geometry geometry(Point, 4326)
);

With this existing table, try to append a row into where comments is an empty string.

from numpy import nan as NaN
from shapely.geometry import Point
from sqlalchemy import create_engine
import geopandas as gpd

engine = create_engine('postgresql:///<your_postgis_connection_info>')

df = gpd.GeoDataFrame({'info': [NaN], 'comments': [''], 'geometry': [Point(0, 0)]}, crs='EPSG:4326')
df.index.rename('id', inplace=True)

df.to_postgis('my_point_layer', engine, if_exists='append', index=True)

Notice that we want to insert a row where info column has np.nan value, and comments column has an empty string ''. This should work since info is nullable while comments is not. But instead we get a traceback.

NotNullViolation                          Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.to_postgis('my_point_layer', engine, if_exists='append', index=True)

File /usr/lib/python3.10/site-packages/geopandas/geodataframe.py:1960, in GeoDataFrame.to_postgis(self, name, con, schema, if_exists, index, index_label, chunksize, dtype)
   1900 def to_postgis(
   1901     self,
   1902     name,
   (...)
   1909     dtype=None,
   1910 ):
   1911     """
   1912     Upload GeoDataFrame into PostGIS database.
   1913 
   (...)
   1958 
   1959     """
-> 1960     geopandas.io.sql._write_postgis(
   1961         self, name, con, schema, if_exists, index, index_label, chunksize, dtype
   1962     )

File /usr/lib/python3.10/site-packages/geopandas/io/sql.py:434, in _write_postgis(gdf, name, con, schema, if_exists, index, index_label, chunksize, dtype)
    430                 raise ValueError(msg)
    432 with _get_conn(con) as connection:
--> 434     gdf.to_sql(
    435         name,
    436         connection,
    437         schema=schema_name,
    438         if_exists=if_exists,
    439         index=index,
    440         index_label=index_label,
    441         chunksize=chunksize,
    442         dtype=dtype,
    443         method=_psql_insert_copy,
    444     )
    446 return

File /usr/lib/python3.10/site-packages/pandas/core/generic.py:2951, in NDFrame.to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
   2794 """
   2795 Write records stored in a DataFrame to a SQL database.
   2796 
   (...)
   2947 [(1,), (None,), (2,)]
   2948 """  # noqa:E501
   2949 from pandas.io import sql
-> 2951 return sql.to_sql(
   2952     self,
   2953     name,
   2954     con,
   2955     schema=schema,
   2956     if_exists=if_exists,
   2957     index=index,
   2958     index_label=index_label,
   2959     chunksize=chunksize,
   2960     dtype=dtype,
   2961     method=method,
   2962 )

File /usr/lib/python3.10/site-packages/pandas/io/sql.py:698, in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method, engine, **engine_kwargs)
    693 elif not isinstance(frame, DataFrame):
    694     raise NotImplementedError(
    695         "'frame' argument should be either a Series or a DataFrame"
    696     )
--> 698 return pandas_sql.to_sql(
    699     frame,
    700     name,
    701     if_exists=if_exists,
    702     index=index,
    703     index_label=index_label,
    704     schema=schema,
    705     chunksize=chunksize,
    706     dtype=dtype,
    707     method=method,
    708     engine=engine,
    709     **engine_kwargs,
    710 )

File /usr/lib/python3.10/site-packages/pandas/io/sql.py:1742, in SQLDatabase.to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method, engine, **engine_kwargs)
   1730 sql_engine = get_engine(engine)
   1732 table = self.prep_table(
   1733     frame=frame,
   1734     name=name,
   (...)
   1739     dtype=dtype,
   1740 )
-> 1742 total_inserted = sql_engine.insert_records(
   1743     table=table,
   1744     con=self.connectable,
   1745     frame=frame,
   1746     name=name,
   1747     index=index,
   1748     schema=schema,
   1749     chunksize=chunksize,
   1750     method=method,
   1751     **engine_kwargs,
   1752 )
   1754 self.check_case_sensitive(name=name, schema=schema)
   1755 return total_inserted

File /usr/lib/python3.10/site-packages/pandas/io/sql.py:1325, in SQLAlchemyEngine.insert_records(self, table, con, frame, name, index, schema, chunksize, method, **engine_kwargs)
   1322 from sqlalchemy import exc
   1324 try:
-> 1325     return table.insert(chunksize=chunksize, method=method)
   1326 except exc.SQLAlchemyError as err:
   1327     # GH34431
   1328     # https://stackoverflow.com/a/67358288/6067848
   1329     msg = r"""(\(1054, "Unknown column 'inf(e0)?' in 'field list'"\))(?#
   1330     )|inf can not be used with MySQL"""

File /usr/lib/python3.10/site-packages/pandas/io/sql.py:951, in SQLTable.insert(self, chunksize, method)
    948     break
    950 chunk_iter = zip(*(arr[start_i:end_i] for arr in data_list))
--> 951 num_inserted = exec_insert(conn, keys, chunk_iter)
    952 # GH 46891
    953 if is_integer(num_inserted):

File /usr/lib/python3.10/site-packages/geopandas/io/sql.py:323, in _psql_insert_copy(tbl, conn, keys, data_iter)
    319 with dbapi_conn.cursor() as cur:
    320     sql = 'COPY "{}"."{}" ({}) FROM STDIN WITH CSV'.format(
    321         tbl.table.schema, tbl.table.name, columns
    322     )
--> 323     cur.copy_expert(sql=sql, file=s_buf)

NotNullViolation: ERREUR:  null value in column "comments" violates not-null constraint in relation "my_point_layer"
DETAIL:  Failing row contains (0, null, null, 0101000020E610000000000000000000000000000000000000).
CONTEXT:  COPY my_point_layer, ligne 1 : « 0,,,0101000020E610000000000000000000000000000000000000 »

We can see that np.nan has been correctly converted to null. However, same thing happened to the empty string, which should have stayed an empty string.

Additional thoughts

Not sure at which layer it is best to report this (GeoPandas, Pandas, psycopg2). But I think the problem lies in GeoPandas hardcoded _psql_insert_copy function in which each row of the GeoDataFrame is written to a CSV buffer using Python default's QUOTE_MINIMAL strategy. Thus empty strings are not quoted and are read as null by PostgreSQL copy statement.

Changing this to QUOTE_NONNUMERIC would likely break everything and is not backward compatible. Adding an optional parameter to to_postgis() allowing to select quoting strategy would be weird, as CSV is an implementation detail.

Maybe a good way to go is to add an optional method parameter. If method is None, then to_postgis() would use the _psql_insert_copy function. Else, it should be a callable to be used instead. This mimic Pandas to_sql() behavior.

Output of geopandas.show_versions()

SYSTEM INFO

python : 3.10.7 (main, Sep 6 2022, 21:22:27) [GCC 12.2.0] executable : /usr/bin/python machine : Linux-5.19.13-zen1-1-zen-x86_64-with-glibc2.36

GEOS, GDAL, PROJ INFO

GEOS : 3.11.0 GEOS lib : /usr/lib/libgeos_c.so GDAL : 3.5.1 GDAL data dir: /usr/share/gdal PROJ : 9.0.1 PROJ data dir: /usr/share/proj

PYTHON DEPENDENCIES

geopandas : 0.11.1 numpy : 1.23.3 pandas : 1.4.4 pyproj : 3.3.1 shapely : 1.8.2 fiona : 1.8.21 geoalchemy2: 0.8.4 geopy : None matplotlib : 3.5.2 mapclassify: None pygeos : None pyogrio : None psycopg2 : 2.9.3 (dt dec pq3 ext lo64) pyarrow : None rtree : None

ygversil avatar Oct 10 '22 15:10 ygversil

Hey, thanks for the thorough report!

Maybe a good way to go is to add an optional method parameter. If method is None, then to_postgis() would use the _psql_insert_copy function. Else, it should be a callable to be used instead. This mimic Pandas to_sql() behavior.

I would vote for this solution. Consistency with pandas is always preferable. It solves the issue you describe and retains full backwards compatibility. And does not add a lot of complexity on our side. Would you be willing to draft a PR?

martinfleis avatar Oct 13 '22 19:10 martinfleis

I'm running into this bug and wondering if you all have suggestions for an interim fix?

diehl avatar Sep 16 '23 20:09 diehl