glue icon indicating copy to clipboard operation
glue copied to clipboard

Pandas DataFrames with type == 'object' cannot be save/restored

Open jfoster17 opened this issue 2 years ago • 0 comments

Describe the bug Pandas DataFrames created within glue and added to the data_collection manager may have columns of type 'object', which mean they cannot be save/restored by glue (glue.core.state._load_numpy calls np.load() without allow_pickle=True). This is generally not a problem when reading files using the Pandas data_factory (which converts columns), but does, for instance cause problems for datasets retrieved from external sources within a glue session.

To Reproduce Steps to reproduce the behavior such as:

  1. Create a Pandas DataFrame within glue and add it to the data_collection. For instance, one might use the process described in the documentation
df1 = DataFrame()
df1['a'] = [1.2, 3.4, 2.9]
df1['g'] = ['r', 'q', 's']
dc['dataframe'] = df1
  1. Save Session (this new Data object will be stored as a numpy array within the session file since it did not come from an external file)
  2. Restore Session
  3. Get the following error:

value error: 'Object arrays cannot be loaded when allow_pickle=False'

Expected behavior Pandas objects created within glue should not break session files.

We could simply add allow_pickle to np.load(), but perhaps this has undesired side effects?

Details:

  • Operating System: MacOS 12.6
  • Python version Python 3.9
  • Glue version 1.6
  • How you installed glue: conda

Additional context Sample session file attached: pandas_dataframe_session.glu.gz

jfoster17 avatar Oct 11 '22 20:10 jfoster17