pylivy
pylivy copied to clipboard
Dataframe serialization loses schema information
We have found some edge cases when using read
to return a dataframe related to the transformation toJSON
and then json.loads
.
Specifically if all values of a column are null
then the column is dropped from the pandas dataframe. In addition we lose type information when coercing to json on all columns. For example, timestamps would be converted to pandas datatimes but are now returned as strings.
Is there an alternative approach we could explore for serializing the dataframe, such as using apache arrow as the toPandas
function does in pyspark?
Hi Jesse, sorry for the delay in getting back to you!
I've thought about implementing an arrrow-based approach for doing this, but never quite got around to it. I think it would make sense to implement this as "use arrow if it's installed, otherwise use JSON", and then set up extras_require
to declare arrow as an optional dependency.
What do you think? The hardest part is probably figuring out the snippet we need to run on the server side to spit out the arrow-serialised dataframe as a byte string and print it in the output of the Spark session. We may need to do something like base64 encode the byte string to avoid issues with Livy returning it inside the JSON response from the API.
Yeah, that sounds great. I might have some time to help with this as well. From looking at the livy API I was also uncertain how to encode the bytes, but will try to test this at this if I get some free time.
Hi guys , I have hit the similar issue recently where duplicated columns in the spark DF do not survive the serializing process into a pandas df. I think we can have cleaner behavior if you leverage the object_pairs_hook in the json.loads method here https://github.com/acroz/pylivy/blob/6c7bf18720345a557f6301ecc02a9c4f5b6fbf78/livy/session.py#L51
this should at least allow us to recover duplicated columns in the pandas DF
line
'{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000,"middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000}'
def dict_rename_on_duplicates(ordered_pairs):
"""Rename duplicate keys."""
d = {}
for k, v in ordered_pairs:
if k in d:
k_1 = '{}*'.format(k)
d[k_1] = v
else:
d[k] = v
return d
x = json.loads(line, object_pairs_hook=dict_rename_on_duplicates)
x
{'firstname': 'James', 'middlename': '', 'lastname': 'Smith', 'id': '36636', 'gender': 'M', 'salary': 3000, 'middlename*': '', 'lastname*': 'Smith', 'id*': '36636', 'gender*': 'M', 'salary*': 3000}