PyAthena
PyAthena copied to clipboard
PandasCursor incorrectly converts some strings as nulls
With pyathena 1.10.0 and pandas 1.0.0:
import pyathena
from pyathena.pandas_cursor import PandasCursor
con = pyathena.connect()
query = "select * from (values ('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (NULL, 'a'))"
print(con.cursor().execute(query).fetchall())
print(con.cursor(PandasCursor).execute(query).fetchall())
Expected output:
[('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (None, 'a')]
[('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (nan, 'a')]
Actual output:
[('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (None, 'a')]
[(nan, 'a'), (nan, 'a'), (nan, 'a'), (nan, 'a')]
Underlying reason: The csv file on AWS that PandasCursor uses looks like this
"_col0","_col1"
"","a"
"N/A","a"
"NULL","a"
,"a"
in https://github.com/laughingman7743/PyAthena/blob/master/pyathena/result_set.py#L445 the call to read_csv
should has at least something like na_values=['']
and keep_default_na=False
.
The problem is how to get pandas read_csv
to treat a quoted empty string as the empty string, and an unquoted empty string as a null (in my opinion "quoting=1" (ALL) should do it, but it doesn't).
I checked it in the following branch, but I can't judge NULL and empty string well. 😢 https://github.com/laughingman7743/PyAthena/pull/120 I want to know a good way to handle these well. 😭
I got the opposite error. Some Nulls are converting to empty string