param icon indicating copy to clipboard operation
param copied to clipboard

Fix dataframe (de)serialization

Open ektar opened this issue 3 years ago • 15 comments

Resolves #599

ektar avatar Feb 03 '22 04:02 ektar

Thanks for pushing this fix! I have a few concerns unfortunately. First, it seems quite inefficient to serialize to a JSON string and then load it again and do the reverse on deserialization. Also a bit confused as even a very simple df doesn't roundtrip for me using this approach:

Screen Shot 2022-02-03 at 17 38 24

The other thing which is not necessarily a problem is that this includes a schema in the data declaration, e.g. for a simple table this might look like this:

{"schema":{"fields":[{"name":"index","type":"integer"},{"name":0,"type":"number"},{"name":1,"type":"number"},{"name":2,"type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"0":0.5529858391,"1":0.3236336179,"2":0.8076434981},{"index":1,"0":0.6460423326,"1":0.3008845818,"2":0.4971573288},{"index":2,"0":0.0956722025,"1":0.4527357795,"2":0.7199068404},{"index":3,"0":0.4972505762,"1":0.5035357012,"2":0.4823526223},{"index":4,"0":0.5667421155,"1":0.4730645023,"2":0.5405153226},{"index":5,"0":0.212481649,"1":0.3453984203,"2":0.9464833136},{"index":6,"0":0.287443691,"1":0.1836393869,"2":0.6931539396},{"index":7,"0":0.4739893758,"1":0.3493285907,"2":0.799432673},{"index":8,"0":0.7580364284,"1":0.354098563,"2":0.7434683717},{"index":9,"0":0.3159887857,"1":0.8479990624,"2":0.5183711565}]}

This is helpful so it can accurately decode the datetime type or other custom types but it's also a bit odd since we try to keep the schema separate from the data.

philippjfr avatar Feb 03 '22 16:02 philippjfr

I will have a closer look at this shortly but looking at the schema Philipp just posted, I have to agree...it is definitely weird to have a schema that includes so many data values like that.

jlstevens avatar Feb 03 '22 16:02 jlstevens

I have to agree...it is definitely weird to have a schema that includes so many data values like that.

The opposite is the case, the data serialization includes the schema, not the other way around.

philippjfr avatar Feb 03 '22 17:02 philippjfr

Thanks for the correction, though either way, these two things shouldn't be mixed together like that!

jlstevens avatar Feb 03 '22 17:02 jlstevens

Good catch!

Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show.

I added test with dataframe created with an array instead of records - test fails as expected, then removed the 'orient' option to just have default... the problem now is that since schema isn't present anymore, pandas has no way to read the date strings and infer they are supposed to be datetype vs string.

Summary thus far - without more significant modifications to give pandas type hints (or make panda's orient=table work for matrices), output dataframe won't match input.

A few options:

  1. Use orient='table' work with pandas devs to fix orient='table' for matrix input. Rationale - functionality currently broken for tables with well defined rows/columns that have dates, which seems a primary use of pandas dataframes. Purely numeric tables seem mostly used for tests, as pandas functions for these offer little advantages over numpy. Schema is generally helpful to provide to pandas during deser to ensure correct dtypes coming back, not rely on inference logic.
  2. Use default orient, specify date format (to preserve time zone), leave it to user or future capability to cast to appropriate types.

orient='table': dates and df's constructed with record format works, df's constructed with matrices fail

def serde_func(df):
    orient='table'
    return pd.read_json(df.to_json(orient=orient), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

orient='table' output:

Date test
Equal: True
Numeric test
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Input In [74], in <module>
     15 print('Numeric test')
     16 df = pd.DataFrame([[3, 3], [4, 2]])
...
File ~/opt/anaconda3/envs/dev-param/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1193, in astype_float_to_int_nansafe(values, dtype, copy)
   1189 """
   1190 astype with a check preventing converting NaN to an meaningless integer value.
   1191 """
   1192 if not np.isfinite(values).all():
-> 1193     raise IntCastingNaNError(
   1194         "Cannot convert non-finite values (NA or inf) to integer"
   1195     )
   1196 return values.astype(dtype, copy=copy)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

default orient - matrices work, dates fail as they're exported as epoch times, imported as ints

def serde_func(df):
    orient=None
    return pd.read_json(df.to_json(orient=orient), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

default orient output image

default orient, specify date output format - matrices work, dates interpreted as strings

def serde_func(df):
    orient=None
    return pd.read_json(df.to_json(orient=orient, date_format='iso'), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

specified date format output image

ektar avatar Feb 03 '22 18:02 ektar

One last note on schema - Schema provides hints to pandas to interpret columns correctly (e.g. dates), also captures table structure in a way the others can't. E.g. indexing:

df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3]
    },
    index=((1, 1), (1, 2), (2, 1))
)
display(df)
json_str=df.to_json()
print(json_str)
display(pd.read_json(df.to_json()))
print()
json_str=df.to_json(orient='table')
print(json_str)
display(pd.read_json(json_str, orient='table'))

output: image

ektar avatar Feb 03 '22 19:02 ektar

As much as I like schemas, orient=table just seems not ready yet, fine to implement just casting dates to iso str and user can parse later - at least will stop current crashing behavior.

If that sounds right, comment and question:

  • Earlier there was a question why do serialize->json load then reverse in serialization... this is just to fit with the overall framework of how json serialization seems to be working in param - all serialization funcs are expected to return a dict, but panda's to_json returns a string. And purpose of using panda's to_json is so that the logic doesn't need to get implemented here. I needed to do reverse in deser before due to table orient with schema. I thought I could just do dataframe load from dict on the return, but it turns out the index gets cast to string instead of ints (tests failed), so needed to do the full reverse round trip to get types correct. This seemed simplest change - thoughts?
  • Current test of df.equals will fail - original dataframe will have datetimes, new dataframe will have string column. Preferences for how to change test? I just updated to do the date cast, and match the timezone of the source.

ektar avatar Feb 03 '22 19:02 ektar

all serialization funcs are expected to return a dict, but panda's to_json returns a string.

It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the serialize and deserialize methods is to use an intermediate representation of some sort where the output of serialize is JSON serializable. As strings (including JSON strings!) are JSON serializable, I don't think you need to use json.loads or json.dumps for this to work. Essentially, you would be embedding pandas JSON output in the param JSON output which is a little weird, but should work.

If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though!

jlstevens avatar Feb 04 '22 11:02 jlstevens

all serialization funcs are expected to return a dict, but panda's to_json returns a string.

It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the serialize and deserialize methods is to use an intermediate representation of some sort where the output of serialize is JSON serializable. As strings (including JSON strings!) are JSON serializable, I don't think you need to use json.loads or json.dumps for this to work. Essentially, you would be embedding pandas JSON output in the param JSON output which is a little weird, but should work.

If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though!

I agree it looks weird - I was copying original implementation to have same return type. I think intent is that the final json be standard, not have long strings that are intended to be further processed. This is used by upstream jsonserializer - serializes all components then calls dumps.

ektar avatar Feb 04 '22 13:02 ektar

@jlstevens @philippjfr - Any changes needed? Happy to implement

ektar avatar Feb 10 '22 06:02 ektar

@ektar you mentioned limitations with orient='table' and that it was not yet ready. I've just made a quick attempt at using it and it looks like it's working fine.

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()
out = pd.read_json(df.to_json(orient='table'), orient='table')
df.equals(out)  # True

Output of `df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A    5 non-null float64
B    5 non-null float64
C    5 non-null object
D    5 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 288.0+ bytes

maximlt avatar May 31 '22 13:05 maximlt

I've thought about this a bit more and I'm fine with the pd.read_json(df.to_json(orient='table'), orient='table') approach which I now think is better than the current approach. Here are my observations now I've managed to clarify my thoughts about this:

  • 'schema' in the pandas JSON should be called dtypes imho - it is not official JSON schema as 'datetime' is not a valid JSON schema type. This 'schema' here is really a specification of dtypes so pandas can deserialize correctly. This is a schema that only exists when a concrete dataframe exists.
  • The Parameter schema generated by param is JSON Schema and it exists at the parameter level, even when there is no concrete dataframe present (e.g.allow_None=True with a value of None).

It is a little confusing in terms of nomenclature, but I think these two things are orthogonal and both have reasons to exist. I have a quibble with the naming used by pandas but what really matters to me is that there is a convenient way to roundtrip the data.

To me, this means that the pd.read_json(df.to_json(orient='table'), orient='table') approach looks good to me in principle. @ektar could you confirm what @maximlt asserts above about the dataframes being equal after round tripping? And if so, what are the remaining problems you see with this approach?

jlstevens avatar May 31 '22 13:05 jlstevens

@ektar, do you intend on completing this?

droumis avatar Jan 16 '23 16:01 droumis

Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show.

For reference, this is the issue with trying to round-trip with Pandas and table='orient' https://github.com/pandas-dev/pandas/issues/46392

maximlt avatar Apr 12 '23 09:04 maximlt

In https://github.com/holoviz/param/pull/600/commits/7e5098c1f1dec7c8f79d27b41cbb1f5f72dd916c I've slightly updated the tests to remove code that was handling timezone conversion. This was failing when I ran it locally (newer Pandas version?), removing that code got the test to pass.

maximlt avatar Apr 12 '23 09:04 maximlt