arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

`read_hdf` for dict has different column order for > 10 rows

Open stress-tess opened this issue 1 year ago • 2 comments

I'm not sure if this matters but read_hdf for dict has different column order for > 10 rows. I found this while working #2602 because pd.testing.assert_frame_equal failed

>>> df_dict = {
   ...:         "c_1": ak.arange(3),
   ...:         "c_2": ak.SegArray(ak.array([0, 9, 14]), ak.arange(-10, 10)),
   ...:         "c_3": ak.arange(3, 6, dtype=ak.uint64),
   ...:         "c_4": ak.SegArray(ak.array([0, 5, 10]), ak.arange(2**63, 2**63 + 15, dtype=ak.uint64)),
   ...:         "c_5": ak.array([False, True, False]),
   ...:         "c_6": ak.SegArray(ak.array([0, 5, 10]), ak.randint(0, 1, 15, dtype=ak.bool)),
   ...:         "c_7": ak.array(np.random.uniform(-50, 50, 3)),
   ...:         "c_8": ak.SegArray(ak.array([0, 9, 14]), ak.array(np.random.uniform(0, 100, 20))),
   ...:         "c_9": ak.array(["abc", "123", "xyz"]),
   ...:         "c_10": ak.SegArray(
   ...:             ak.array([0, 2, 5]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
   ...:         ),
   ...:         "c_11": ak.SegArray(
   ...:             ak.array([0, 2, 2]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
   ...:         ),
   ...:         "c_12": ak.SegArray(
   ...:             ak.array([0, 0, 2]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
   ...:         ),
   ...:         "c_13": ak.SegArray(
   ...:             ak.array([0, 5, 8]), ak.array(["a", "b", "c", "d", "e", "f", "g", "h", "i"])
   ...:         ),
   ...:         "c_14": ak.SegArray(
   ...:             ak.array([0, 5, 8]), ak.array(["abc", "123", "xyz", "l", "m", "n", "o", "p", "arkouda"])
   ...:         ),
   ...:     }

>>> akdf = ak.DataFrame(df_dict)

>>> akdf.to_hdf("multi_col_hdf")

>>> rd_data = ak.read_hdf("multi_col_hdf*")

>>> rd_df = ak.DataFrame(rd_data)

>>> print(akdf)
DataFrame(['c_1', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14'], 3 rows, 87 B)

>>> print(rd_df)
DataFrame(['c_1', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9'], 3 rows, 87 B)

>>> pd.testing.assert_frame_equal(akdf.to_pandas(), rd_df.to_pandas())
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[13], line 1
----> 1 pd.testing.assert_frame_equal(akdf.to_pandas(), rd_df.to_pandas())

    [... skipping hidden 2 frame]

File /opt/homebrew/Caskroom/miniforge/base/envs/arkouda-dev/lib/python3.10/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()

File /opt/homebrew/Caskroom/miniforge/base/envs/arkouda-dev/lib/python3.10/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()

File /opt/homebrew/Caskroom/miniforge/base/envs/arkouda-dev/lib/python3.10/site-packages/pandas/_testing/asserters.py:679, in raise_assert_detail(obj, message, left, right, diff, index_values)
    676 if diff is not None:
    677     msg += f"\n[diff]: {diff}"
--> 679 raise AssertionError(msg)

AssertionError: DataFrame.columns are different

DataFrame.columns values are different (92.85714 %)
[left]:  Index(['c_1', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'c_7', 'c_8', 'c_9', 'c_10',
       'c_11', 'c_12', 'c_13', 'c_14'],
      dtype='object')
[right]: Index(['c_1', 'c_10', 'c_11', 'c_12', 'c_13', 'c_14', 'c_2', 'c_3', 'c_4',
       'c_5', 'c_6', 'c_7', 'c_8', 'c_9'],
      dtype='object')

stress-tess avatar Jul 24 '23 23:07 stress-tess

It's also worth noting this is pretty easy to workaround if this is something we don't care about. I just have to do rd_df = rd_df[akdf.columns]

stress-tess avatar Jul 24 '23 23:07 stress-tess

This is an extremely low priority, but would be worth fixing at some point when other tasking has slowed down.

Ethan-DeBandi99 avatar Jul 25 '23 17:07 Ethan-DeBandi99