PyAthena icon indicating copy to clipboard operation
PyAthena copied to clipboard

PandasCursor converts NULL values in Sttring columns to empty String

Open jurgispods opened this issue 5 years ago • 5 comments

In #117, a fix was made for NULL results not being returned by PandasCursor. This now work as expected:

> ret = conn.cursor(PandasCursor).execute("select * from (values (1), (NULL))").fetchall()
> ret
[(1,), (<NA>,)]
> [pd.isna(x[0]) for x in ret]
[False, True]

However, NULL values for String columns are secretly converted to empty Strings:

> ret = conn.cursor(PandasCursor).execute("select * from (values ('bla'), (NULL))").fetchall()
> ret
[('bla',), ('',)]
> [pd.isna(x[0]) for x in ret]
[False, False]

Is this the expected behaviour? I believe NULL should always be converted to NaN, regardless of na_values or keep_default_na.

jurgispods avatar Oct 26 '20 13:10 jurgispods

ret = conn.cursor(PandasCursor).execute("select * from (values (1), (NULL))").fetchall() ret [('bla',), ('',)] [pd.isna(x[0]) for x in ret] [False, False]

It seems that the query in this example is wrong. In this case, what kind of query are you executing? It may be the same issue as #118.

laughingman7743 avatar Oct 26 '20 13:10 laughingman7743

@laughingman7743 Apologies, there was a copy-paste error in my second query. I've edited the original post.

jurgispods avatar Oct 26 '20 15:10 jurgispods

And regarding #118: I understand that I can use na_values='' to force converting the respective date in the Athena-generated CSV to NaN. But this means losing the possibility to distinguish between NULL and the empty String in the original Athena table.

I guess this cannot be circumvented, as the information is already lost in the Athena-generated CSV?

jurgispods avatar Oct 26 '20 15:10 jurgispods

If you check the CSV file output from Athena, you'll see that empty characters are double-quoted, while null characters are not.

"_col0"
"blah"

""

If you can handle them well, you should be able to identify null and empty characters. But I don't know how to configure the read_csv method's options to better read this CSV.

laughingman7743 avatar Oct 27 '20 12:10 laughingman7743

I see. One approach would be to disable quoting and remove the quotes afterwards. With your example in mind:

import pandas as pd
import csv

df = pd.read_csv('myfile.csv', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
# remove quotes from String columns manually
df.select_dtypes([object]).apply(lambda col: col.str[1:-1])

This is not very elegant, but if CSVs written by Athena are guaranteed to contain quotes Strings, this should always work. What do you think?

jurgispods avatar Oct 27 '20 13:10 jurgispods