geopandas
geopandas copied to clipboard
BUG: .to_postgis() unwantedly convert empty strings as NULL
Problem description
When trying to append a GeoDataFrame into a Postgis table, empty strings are all converted to NULL values. This is not always wanted.
For example, say you have a simple Postgis table with a comments
column having type text
, with constraint not null
(design decisions may be poor, but I don't have permissions to change table definition. In this design, having no comment should be represented as an empty string, not as a null
value).
create table my_point_layer (
id bigint primary key,
info smallint,
comments text not null default '',
geometry geometry(Point, 4326)
);
With this existing table, try to append a row into where comments
is an empty string.
from numpy import nan as NaN
from shapely.geometry import Point
from sqlalchemy import create_engine
import geopandas as gpd
engine = create_engine('postgresql:///<your_postgis_connection_info>')
df = gpd.GeoDataFrame({'info': [NaN], 'comments': [''], 'geometry': [Point(0, 0)]}, crs='EPSG:4326')
df.index.rename('id', inplace=True)
df.to_postgis('my_point_layer', engine, if_exists='append', index=True)
Notice that we want to insert a row where info
column has np.nan
value, and comments
column has an empty string ''
. This should work since info
is nullable while comments
is not. But instead we get a traceback.
NotNullViolation Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.to_postgis('my_point_layer', engine, if_exists='append', index=True)
File /usr/lib/python3.10/site-packages/geopandas/geodataframe.py:1960, in GeoDataFrame.to_postgis(self, name, con, schema, if_exists, index, index_label, chunksize, dtype)
1900 def to_postgis(
1901 self,
1902 name,
(...)
1909 dtype=None,
1910 ):
1911 """
1912 Upload GeoDataFrame into PostGIS database.
1913
(...)
1958
1959 """
-> 1960 geopandas.io.sql._write_postgis(
1961 self, name, con, schema, if_exists, index, index_label, chunksize, dtype
1962 )
File /usr/lib/python3.10/site-packages/geopandas/io/sql.py:434, in _write_postgis(gdf, name, con, schema, if_exists, index, index_label, chunksize, dtype)
430 raise ValueError(msg)
432 with _get_conn(con) as connection:
--> 434 gdf.to_sql(
435 name,
436 connection,
437 schema=schema_name,
438 if_exists=if_exists,
439 index=index,
440 index_label=index_label,
441 chunksize=chunksize,
442 dtype=dtype,
443 method=_psql_insert_copy,
444 )
446 return
File /usr/lib/python3.10/site-packages/pandas/core/generic.py:2951, in NDFrame.to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
2794 """
2795 Write records stored in a DataFrame to a SQL database.
2796
(...)
2947 [(1,), (None,), (2,)]
2948 """ # noqa:E501
2949 from pandas.io import sql
-> 2951 return sql.to_sql(
2952 self,
2953 name,
2954 con,
2955 schema=schema,
2956 if_exists=if_exists,
2957 index=index,
2958 index_label=index_label,
2959 chunksize=chunksize,
2960 dtype=dtype,
2961 method=method,
2962 )
File /usr/lib/python3.10/site-packages/pandas/io/sql.py:698, in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method, engine, **engine_kwargs)
693 elif not isinstance(frame, DataFrame):
694 raise NotImplementedError(
695 "'frame' argument should be either a Series or a DataFrame"
696 )
--> 698 return pandas_sql.to_sql(
699 frame,
700 name,
701 if_exists=if_exists,
702 index=index,
703 index_label=index_label,
704 schema=schema,
705 chunksize=chunksize,
706 dtype=dtype,
707 method=method,
708 engine=engine,
709 **engine_kwargs,
710 )
File /usr/lib/python3.10/site-packages/pandas/io/sql.py:1742, in SQLDatabase.to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method, engine, **engine_kwargs)
1730 sql_engine = get_engine(engine)
1732 table = self.prep_table(
1733 frame=frame,
1734 name=name,
(...)
1739 dtype=dtype,
1740 )
-> 1742 total_inserted = sql_engine.insert_records(
1743 table=table,
1744 con=self.connectable,
1745 frame=frame,
1746 name=name,
1747 index=index,
1748 schema=schema,
1749 chunksize=chunksize,
1750 method=method,
1751 **engine_kwargs,
1752 )
1754 self.check_case_sensitive(name=name, schema=schema)
1755 return total_inserted
File /usr/lib/python3.10/site-packages/pandas/io/sql.py:1325, in SQLAlchemyEngine.insert_records(self, table, con, frame, name, index, schema, chunksize, method, **engine_kwargs)
1322 from sqlalchemy import exc
1324 try:
-> 1325 return table.insert(chunksize=chunksize, method=method)
1326 except exc.SQLAlchemyError as err:
1327 # GH34431
1328 # https://stackoverflow.com/a/67358288/6067848
1329 msg = r"""(\(1054, "Unknown column 'inf(e0)?' in 'field list'"\))(?#
1330 )|inf can not be used with MySQL"""
File /usr/lib/python3.10/site-packages/pandas/io/sql.py:951, in SQLTable.insert(self, chunksize, method)
948 break
950 chunk_iter = zip(*(arr[start_i:end_i] for arr in data_list))
--> 951 num_inserted = exec_insert(conn, keys, chunk_iter)
952 # GH 46891
953 if is_integer(num_inserted):
File /usr/lib/python3.10/site-packages/geopandas/io/sql.py:323, in _psql_insert_copy(tbl, conn, keys, data_iter)
319 with dbapi_conn.cursor() as cur:
320 sql = 'COPY "{}"."{}" ({}) FROM STDIN WITH CSV'.format(
321 tbl.table.schema, tbl.table.name, columns
322 )
--> 323 cur.copy_expert(sql=sql, file=s_buf)
NotNullViolation: ERREUR: null value in column "comments" violates not-null constraint in relation "my_point_layer"
DETAIL: Failing row contains (0, null, null, 0101000020E610000000000000000000000000000000000000).
CONTEXT: COPY my_point_layer, ligne 1 : « 0,,,0101000020E610000000000000000000000000000000000000 »
We can see that np.nan
has been correctly converted to null
. However, same thing happened to the empty string, which should have stayed an empty string.
Additional thoughts
Not sure at which layer it is best to report this (GeoPandas, Pandas, psycopg2). But I think the problem lies in GeoPandas hardcoded _psql_insert_copy
function in which each row of the GeoDataFrame is written to a CSV buffer using Python default's QUOTE_MINIMAL
strategy. Thus empty strings are not quoted and are read as null
by PostgreSQL copy
statement.
Changing this to QUOTE_NONNUMERIC
would likely break everything and is not backward compatible. Adding an optional parameter to to_postgis()
allowing to select quoting strategy would be weird, as CSV is an implementation detail.
Maybe a good way to go is to add an optional method
parameter. If method
is None
, then to_postgis()
would use the _psql_insert_copy
function. Else, it should be a callable to be used instead. This mimic Pandas to_sql()
behavior.
Output of geopandas.show_versions()
SYSTEM INFO
python : 3.10.7 (main, Sep 6 2022, 21:22:27) [GCC 12.2.0] executable : /usr/bin/python machine : Linux-5.19.13-zen1-1-zen-x86_64-with-glibc2.36
GEOS, GDAL, PROJ INFO
GEOS : 3.11.0 GEOS lib : /usr/lib/libgeos_c.so GDAL : 3.5.1 GDAL data dir: /usr/share/gdal PROJ : 9.0.1 PROJ data dir: /usr/share/proj
PYTHON DEPENDENCIES
geopandas : 0.11.1 numpy : 1.23.3 pandas : 1.4.4 pyproj : 3.3.1 shapely : 1.8.2 fiona : 1.8.21 geoalchemy2: 0.8.4 geopy : None matplotlib : 3.5.2 mapclassify: None pygeos : None pyogrio : None psycopg2 : 2.9.3 (dt dec pq3 ext lo64) pyarrow : None rtree : None
Hey, thanks for the thorough report!
Maybe a good way to go is to add an optional method parameter. If method is None, then to_postgis() would use the _psql_insert_copy function. Else, it should be a callable to be used instead. This mimic Pandas to_sql() behavior.
I would vote for this solution. Consistency with pandas is always preferable. It solves the issue you describe and retains full backwards compatibility. And does not add a lot of complexity on our side. Would you be willing to draft a PR?
I'm running into this bug and wondering if you all have suggestions for an interim fix?