pylivy icon indicating copy to clipboard operation
pylivy copied to clipboard

Dataframe serialization loses schema information

Open quartox opened this issue 4 years ago • 3 comments

We have found some edge cases when using read to return a dataframe related to the transformation toJSON and then json.loads.

Specifically if all values of a column are null then the column is dropped from the pandas dataframe. In addition we lose type information when coercing to json on all columns. For example, timestamps would be converted to pandas datatimes but are now returned as strings.

Is there an alternative approach we could explore for serializing the dataframe, such as using apache arrow as the toPandas function does in pyspark?

quartox avatar Jan 12 '21 18:01 quartox

Hi Jesse, sorry for the delay in getting back to you!

I've thought about implementing an arrrow-based approach for doing this, but never quite got around to it. I think it would make sense to implement this as "use arrow if it's installed, otherwise use JSON", and then set up extras_require to declare arrow as an optional dependency.

What do you think? The hardest part is probably figuring out the snippet we need to run on the server side to spit out the arrow-serialised dataframe as a byte string and print it in the output of the Spark session. We may need to do something like base64 encode the byte string to avoid issues with Livy returning it inside the JSON response from the API.

acroz avatar Jan 27 '21 19:01 acroz

Yeah, that sounds great. I might have some time to help with this as well. From looking at the livy API I was also uncertain how to encode the bytes, but will try to test this at this if I get some free time.

quartox avatar Jan 27 '21 22:01 quartox

Hi guys , I have hit the similar issue recently where duplicated columns in the spark DF do not survive the serializing process into a pandas df. I think we can have cleaner behavior if you leverage the object_pairs_hook in the json.loads method here https://github.com/acroz/pylivy/blob/6c7bf18720345a557f6301ecc02a9c4f5b6fbf78/livy/session.py#L51

this should at least allow us to recover duplicated columns in the pandas DF

line
'{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000,"middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000}'
def dict_rename_on_duplicates(ordered_pairs):
    """Rename duplicate keys."""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
           k_1 = '{}*'.format(k) 
           d[k_1] = v
        else:
           d[k] = v
    return d
x = json.loads(line, object_pairs_hook=dict_rename_on_duplicates)
x
{'firstname': 'James', 'middlename': '', 'lastname': 'Smith', 'id': '36636', 'gender': 'M', 'salary': 3000, 'middlename*': '', 'lastname*': 'Smith', 'id*': '36636', 'gender*': 'M', 'salary*': 3000}

ryanpetm avatar Jan 04 '22 16:01 ryanpetm