PyAthena Converter class does not convert Athena string data to pandas str type

First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!

With pyathena=1.10.7 and pandas=1.0.5 I am running the following code with the expectation that the converter class will cast the Athena string data type as an str pandas dtype.

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.converter import Converter

class CustomPandasTypeConverter(Converter):

    def __init__(self):
        super(CustomPandasTypeConverter, self).__init__(
            mappings=None,
            types={
                'boolean': bool,
                'tinyint': int,
                'smallint': int,
                'integer': int,
                'bigint': int,
                'float': float,
                'real': float,
                'double': float,
                'decimal': float,
                'char': str,
                'varchar': str,
                'array': str,
                'map': str,
                'row': str,
                'varbinary': str,
                'json': str,
                'string': str
            }
        )

    def convert(self, type_, value):
        # Not used in PandasCursor.
        pass
    
cur = connect(s3_staging_dir='<staging_directory_url>',
                region_name='<aws_region>',
                cursor_class = PandasCursor,
                converter=CustomPandasTypeConverter(),
                work_group = '<workgroup_name>').cursor()

query = 'SELECT * FROM <schema>.<table>'
df = cur.execute(query).as_pandas()
df.dtypes

When I inspect the dtypes, Athena ints are converted to pandas ints, decimals are converted to floats and strings are consistently returned as object dtypes. However Athena string NULLs are cast as NaNs which require explicit column-by-column fillna operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?

Also, I'm not sure where the right place to ask this is, but are there any plans to implement a PySparkCursor for PyAthena? If not can I help by contributing?

Jul 01 '20 21:07 krishanunandy

The problem with #118 is that the string type cannot distinguish between empty and null characters. Currently, all empty characters are treated as null.

The reason Dtype is an Object is probably because the type is represented as a Python String object. Isn't the missing value of the String object a NaN? Is there any other way to describe it?

I don't plan to implement PySparkCursor. Contributions are free. However, you are required to provide your own test environment with AWS.

Jul 02 '20 12:07 laughingman7743

The reason Dtype is an Object is probably because the type is represented as a Python String object.

That's my understanding too.

Isn't the missing value of the String object a NaN? Is there any other way to describe it?

A NaN is a contraction of 'Not a Number' - it probably shouldn't be used to describe the missing value for a string though I think it is. Until very recently, pandas int dtypes didn't support null values - only float did - and datetimes have their own NaT (Not a Time) type. While not ideal, giving the user the choice to explicitly cast an Athena NULL string as an empty string ('') might be a good solution until someone comes up with an NaS type?

I don't plan to implement PySparkCursor. Contributions are free. However, you are required to provide your own test environment with AWS.

Thanks for the information. I'll look at the source code and see what I can come up with.

Jul 02 '20 17:07 krishanunandy

https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values If Pnadas 1.0.0 or higher, it seems to good to handle all missing values in pd.NA.

Jul 03 '20 00:07 laughingman7743

Good find. It seems though, that -

PyAthena casts strings as object type (i.e. Numpy arrays) instead of the new dedicated StringDtype
The StringDtype is still experimental and probably shouldn't be a dependency of PyAthena

Any thoughts on a potential solution? Pinging @EdwardJRoss since he seems to understand the codebase better than I do.

Jul 09 '20 16:07 krishanunandy

@krishanunandy As noted upthread in Pandas the canonical way of representing strings, before the dedicated StringDtype, was as objects. dtype=str is accepted but results in an object dtype. This is just how Pandas works and has nothing to do with PyAthena, at least not directly:

>>> pd.Series(['a', 'b'], dtype=str).reindex([0, 1, 2])
0    a
1    b
2    NaN
dtype: object

Pandas of course is developing the new String dtype which is the ultimate solution here. You could change your custom converter to use "string" instead of str and use the new dtype.

However, given your use case, it is relevant that through version 3.0, Spark has no support for pd.NA, in any of the new nullable dtypes (including Int64, which PyAthena supports). See SPARK-30966 in the Spark Jira for the ticket on this. Realistically you're probably looking at Spark 3.1... maybe early 2021?... at the earliest.

My suggestion would just be to write a wrapper method that calls .as_pandas(), takes the result, loops over the cursor description to get the type of each column, and runs .fillna('') on any columns with string type.

Aug 14 '20 02:08 Liam3851

PyAthena PyAthena copied to clipboard

Converter class does not convert Athena string data to pandas str type

PyAthena
PyAthena copied to clipboard