PyAthena
PyAthena copied to clipboard
Converter class does not convert Athena string data to pandas str type
First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!
With pyathena=1.10.7 and pandas=1.0.5 I am running the following code with the expectation that the converter class will cast the Athena string data type as an str pandas dtype.
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.converter import Converter
class CustomPandasTypeConverter(Converter):
def __init__(self):
super(CustomPandasTypeConverter, self).__init__(
mappings=None,
types={
'boolean': bool,
'tinyint': int,
'smallint': int,
'integer': int,
'bigint': int,
'float': float,
'real': float,
'double': float,
'decimal': float,
'char': str,
'varchar': str,
'array': str,
'map': str,
'row': str,
'varbinary': str,
'json': str,
'string': str
}
)
def convert(self, type_, value):
# Not used in PandasCursor.
pass
cur = connect(s3_staging_dir='<staging_directory_url>',
region_name='<aws_region>',
cursor_class = PandasCursor,
converter=CustomPandasTypeConverter(),
work_group = '<workgroup_name>').cursor()
query = 'SELECT * FROM <schema>.<table>'
df = cur.execute(query).as_pandas()
df.dtypes
When I inspect the dtypes, Athena ints are converted to pandas ints, decimals are converted to floats and strings are consistently returned as object dtypes. However Athena string NULLs are cast as NaNs which require explicit column-by-column fillna operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?
Also, I'm not sure where the right place to ask this is, but are there any plans to implement a PySparkCursor for PyAthena? If not can I help by contributing?
The problem with #118 is that the string type cannot distinguish between empty and null characters. Currently, all empty characters are treated as null.
The reason Dtype is an Object is probably because the type is represented as a Python String object. Isn't the missing value of the String object a NaN? Is there any other way to describe it?
I don't plan to implement PySparkCursor. Contributions are free. However, you are required to provide your own test environment with AWS.
The reason Dtype is an Object is probably because the type is represented as a Python String object.
That's my understanding too.
Isn't the missing value of the String object a NaN? Is there any other way to describe it?
A NaN is a contraction of 'Not a Number' - it probably shouldn't be used to describe the missing value for a string though I think it is. Until very recently, pandas int dtypes didn't support null values - only float did - and datetimes have their own NaT (Not a Time) type. While not ideal, giving the user the choice to explicitly cast an Athena NULL string as an empty string ('') might be a good solution until someone comes up with an NaS type?
I don't plan to implement PySparkCursor. Contributions are free. However, you are required to provide your own test environment with AWS.
Thanks for the information. I'll look at the source code and see what I can come up with.
https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values If Pnadas 1.0.0 or higher, it seems to good to handle all missing values in pd.NA.
Good find. It seems though, that -
- PyAthena casts strings as
objecttype (i.e. Numpy arrays) instead of the new dedicatedStringDtype - The
StringDtypeis still experimental and probably shouldn't be a dependency of PyAthena
Any thoughts on a potential solution? Pinging @EdwardJRoss since he seems to understand the codebase better than I do.
@krishanunandy As noted upthread in Pandas the canonical way of representing strings, before the dedicated StringDtype, was as objects. dtype=str is accepted but results in an object dtype. This is just how Pandas works and has nothing to do with PyAthena, at least not directly:
>>> pd.Series(['a', 'b'], dtype=str).reindex([0, 1, 2])
0 a
1 b
2 NaN
dtype: object
Pandas of course is developing the new String dtype which is the ultimate solution here. You could change your custom converter to use "string" instead of str and use the new dtype.
However, given your use case, it is relevant that through version 3.0, Spark has no support for pd.NA, in any of the new nullable dtypes (including Int64, which PyAthena supports). See SPARK-30966 in the Spark Jira for the ticket on this. Realistically you're probably looking at Spark 3.1... maybe early 2021?... at the earliest.
My suggestion would just be to write a wrapper method that calls .as_pandas(), takes the result, loops over the cursor description to get the type of each column, and runs .fillna('') on any columns with string type.