matminer icon indicating copy to clipboard operation
matminer copied to clipboard

load_dataframe_from_json fails on MultiIndex

Open janosh opened this issue 3 years ago • 0 comments

I just noticed there's a problem with load_dataframe_from_json when trying to load multi-index dataframes.

from matminer.utils.io import load_dataframe_from_json, store_dataframe_as_json
import numpy as np
import pandas as pd


arr = np.arange(20).reshape(5, 4)

df = pd.DataFrame(arr, columns=list("abcd"))


store_dataframe_as_json(df, "df.json")
df = load_dataframe_from_json("df.json")
# all good here


df = pd.DataFrame(arr, columns=list("abcd")).set_index(["a", "b"])


store_dataframe_as_json(df, "df.json")
df = load_dataframe_from_json("df.json")
>>> ValueError: Shape of passed values is (5, 2), indices imply (2, 2)

That's because pandas doesn't support passing in a list of lists as a multi-index. Instead you have to create a MultiIndex object first and pass that in

idx = [[i, i + 1] for i in range(5)]
pd.DataFrame(arr, columns=list("abcd"), index=idx)
>>> ValueError: Shape of passed values is (5, 4), indices imply (2, 4)


idx = pd.MultiIndex.from_tuples(((i, i + 1) for i in range(5)))
pd.DataFrame(arr, columns=list("abcd"), index=idx)

So one possible fix would be

    if isinstance(dataframe_data, dict):
        if set(dataframe_data.keys()) == {"data", "columns", "index"}:
+           if type(dataframe_data['index'][0]) == list:
+               dataframe_data['index'] = pandas.MultiIndex.from_tuples(dataframe_data['index'])
            return pandas.DataFrame(**dataframe_data)

janosh avatar Jul 22 '21 14:07 janosh