PyAthena PandasCursor incorrectly converts some strings as nulls

PandasCursor incorrectly converts some strings as nulls

Open EdwardJRoss opened this issue 5 years ago • 2 comments

With pyathena 1.10.0 and pandas 1.0.0:

import pyathena
from pyathena.pandas_cursor import PandasCursor

con = pyathena.connect()

query = "select * from (values ('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (NULL, 'a'))"

print(con.cursor().execute(query).fetchall())
print(con.cursor(PandasCursor).execute(query).fetchall())

Expected output:

[('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (None, 'a')]
[('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (nan, 'a')]

Actual output:

[('', 'a'), ('N/A', 'a'), ('NULL', 'a'), (None, 'a')]
[(nan, 'a'), (nan, 'a'), (nan, 'a'), (nan, 'a')]

Underlying reason: The csv file on AWS that PandasCursor uses looks like this

"_col0","_col1"
"","a"
"N/A","a"
"NULL","a"
,"a"

in https://github.com/laughingman7743/PyAthena/blob/master/pyathena/result_set.py#L445 the call to read_csv should has at least something like na_values=[''] and keep_default_na=False.

The problem is how to get pandas read_csv to treat a quoted empty string as the empty string, and an unquoted empty string as a null (in my opinion "quoting=1" (ALL) should do it, but it doesn't).

Feb 02 '20 23:02 EdwardJRoss

I checked it in the following branch, but I can't judge NULL and empty string well. 😢 https://github.com/laughingman7743/PyAthena/pull/120 I want to know a good way to handle these well. 😭

Feb 06 '20 15:02 laughingman7743

I got the opposite error. Some Nulls are converting to empty string

Aug 28 '23 22:08 acpguedes

PyAthena PyAthena copied to clipboard

PandasCursor incorrectly converts some strings as nulls

PyAthena
PyAthena copied to clipboard