arkouda
arkouda copied to clipboard
`read_hdf` for dict has different column order for > 10 rows
I'm not sure if this matters but read_hdf
for dict has different column order for > 10
rows. I found this while working #2602 because pd.testing.assert_frame_equal
failed
>>> df_dict = {
...: "c_1": ak.arange(3),
...: "c_2": ak.SegArray(ak.array([0, 9, 14]), ak.arange(-10, 10)),
...: "c_3": ak.arange(3, 6, dtype=ak.uint64),
...: "c_4": ak.SegArray(ak.array([0, 5, 10]), ak.arange(2**63, 2**63 + 15, dtype=ak.uint64)),
...: "c_5": ak.array([False, True, False]),
...: "c_6": ak.SegArray(ak.array([0, 5, 10]), ak.randint(0, 1, 15, dtype=ak.bool)),
...: "c_7": ak.array(np.random.uniform(-50, 50, 3)),
...: "c_8": ak.SegArray(ak.array([0, 9, 14]), ak.array(np.random.uniform(0, 100, 20))),
...: "c_9": ak.array(["abc", "123", "xyz"]),
...: "c_10": ak.SegArray(
...: ak.array([0, 2, 5]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
...: ),
...: "c_11": ak.SegArray(
...: ak.array([0, 2, 2]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
...: ),
...: "c_12": ak.SegArray(
...: ak.array([0, 0, 2]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
...: ),
...: "c_13": ak.SegArray(
...: ak.array([0, 5, 8]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
...: ),
...: "c_14": ak.SegArray(
...: ak.array([0, 5, 8]), ak.array(["abc", "123", "xyz", "l", "m", "n", "o", "p", "arkouda"])
...: ),
...: }
>>> akdf = ak.DataFrame(df_dict)
>>> akdf.to_hdf("multi_col_hdf")
>>> rd_data = ak.read_hdf("multi_col_hdf*")
>>> rd_df = ak.DataFrame(rd_data)
>>> print(akdf)
DataFrame(['c_1', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14'], 3 rows, 87 B)
>>> print(rd_df)
DataFrame(['c_1', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9'], 3 rows, 87 B)
>>> pd.testing.assert_frame_equal(akdf.to_pandas(), rd_df.to_pandas())
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[13], line 1
----> 1 pd.testing.assert_frame_equal(akdf.to_pandas(), rd_df.to_pandas())
[... skipping hidden 2 frame]
File /opt/homebrew/Caskroom/miniforge/base/envs/arkouda-dev/lib/python3.10/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()
File /opt/homebrew/Caskroom/miniforge/base/envs/arkouda-dev/lib/python3.10/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()
File /opt/homebrew/Caskroom/miniforge/base/envs/arkouda-dev/lib/python3.10/site-packages/pandas/_testing/asserters.py:679, in raise_assert_detail(obj, message, left, right, diff, index_values)
676 if diff is not None:
677 msg += f"\n[diff]: {diff}"
--> 679 raise AssertionError(msg)
AssertionError: DataFrame.columns are different
DataFrame.columns values are different (92.85714 %)
[left]: Index(['c_1', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9', 'c_10',
'c_11', 'c_12', 'c_13', 'c_14'],
dtype='object')
[right]: Index(['c_1', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14', 'c_2', 'c_3', 'c_4',
'c_5', 'c_6', 'c_7', 'c_8', 'c_9'],
dtype='object')
It's also worth noting this is pretty easy to workaround if this is something we don't care about. I just have to do rd_df = rd_df[akdf.columns]
This is an extremely low priority, but would be worth fixing at some point when other tasking has slowed down.