hail Table.to_pandas appears to maintain reference to hail objects

What happened?

Hello, I have some gwas results that I have converted into a pandas dataframe. Typically with dataframes I pickle my outputs for speed and easily maintaining data types. Within All of Us we have separate analysis environments whether we're using hail or not. The environment that doesn't have hail is much cheaper for simple analyses and does not have pyspark installed. You can see in the error below when I try to reread the pickled dataframe, I get an error that it can't find the pyspark from within a hail module. If I write the dataframe as a csv, read it back in, and then pickle it then the error goes away. This suggests to me that the dataframe created by hail maintains reference to hail objects and pandas is attempting to recreate these objects when unpickling. I suspect this is not intentional.

# Hail environment
vat_simplified_file = os.path.join(bucket, 'vat.ht')
gwas = hl.read_table(gwas_results_file_no_sex_chr)
vat = hl.read_table(vat_simplified_file)
gwas = gwas.filter(gwas.p_value <= 1e-4)
combined = gwas.join(vat, how='left')
combined_pandas = combined.to_pandas()

gwas_pandas_file = os.path.join(bucket, 'gwas_results.pkl')
combined_pandas.to_pickle(gwas_pandas_file)

# Non hail environment without pyspark
combined_pandas = pd.read_pickle(gwas_pandas_file)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression, storage_options)
    216                     # expected "IO[bytes]"
--> 217                     return pickle.load(handles.handle)  # type: ignore[arg-type]
    218             except excs_to_catch:

/opt/conda/lib/python3.7/site-packages/hail/__init__.py in <module>
     32 # E402 module level import not at top of file
---> 33 from .table import Table, GroupedTable, asc, desc  # noqa: E402
     34 from .matrixtable import MatrixTable, GroupedMatrixTable  # noqa: E402

/opt/conda/lib/python3.7/site-packages/hail/table.py in <module>
      4 import numpy as np
----> 5 import pyspark
      6 from typing import Optional, Dict, Callable, Sequence

ModuleNotFoundError: No module named 'pyspark'

During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_233/4275665471.py in <module>
----> 1 combined_pandas = pd.read_pickle(gwas_pandas_file)

/opt/conda/lib/python3.7/site-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression, storage_options)
    220                 #  "No module named 'pandas.core.sparse.series'"
    221                 #  "Can't get attribute '__nat_unpickle' on <module 'pandas._libs.tslib"
--> 222                 return pc.load(handles.handle, encoding=None)
    223         except UnicodeDecodeError:
    224             # e.g. can occur for files written in py27; see GH#28645 and GH#31988

/opt/conda/lib/python3.7/site-packages/pandas/compat/pickle_compat.py in load(fh, encoding, is_verbose)
    272         up.is_verbose = is_verbose
    273 
--> 274         return up.load()
    275     except (ValueError, TypeError):
    276         raise

/opt/conda/lib/python3.7/pickle.py in load(self)
   1086                     raise EOFError
   1087                 assert isinstance(key, bytes_types)
-> 1088                 dispatch[key[0]](self)
   1089         except _Stop as stopinst:
   1090             return stopinst.value

/opt/conda/lib/python3.7/pickle.py in load_stack_global(self)
   1383         if type(name) is not str or type(module) is not str:
   1384             raise UnpicklingError("STACK_GLOBAL requires str")
-> 1385         self.append(self.find_class(module, name))
   1386     dispatch[STACK_GLOBAL[0]] = load_stack_global
   1387 

/opt/conda/lib/python3.7/site-packages/pandas/compat/pickle_compat.py in find_class(self, module, name)
    204         key = (module, name)
    205         module, name = _class_locations_map.get(key, key)
--> 206         return super().find_class(module, name)
    207 
    208 

/opt/conda/lib/python3.7/pickle.py in find_class(self, module, name)
   1424             elif module in _compat_pickle.IMPORT_MAPPING:
   1425                 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1426         __import__(module, level=0)
   1427         if self.proto >= 4:
   1428             return _getattribute(sys.modules[module], name)[0]

/opt/conda/lib/python3.7/site-packages/hail/__init__.py in <module>
     31 # F401 '.expr.*' imported but unused
     32 # E402 module level import not at top of file
---> 33 from .table import Table, GroupedTable, asc, desc  # noqa: E402
     34 from .matrixtable import MatrixTable, GroupedMatrixTable  # noqa: E402
     35 from .expr import *  # noqa: F401,F403,E402

/opt/conda/lib/python3.7/site-packages/hail/table.py in <module>
      3 import pandas
      4 import numpy as np
----> 5 import pyspark
      6 from typing import Optional, Dict, Callable, Sequence
      7 

ModuleNotFoundError: No module named 'pyspark'

Thanks, Andrew

Version

0.2.107-2387bb00ceee

Relevant log output

No response

Nov 10 '23 15:11 anh151

Adding a simple reproducible example.

ht = hl.Table.from_pandas(pd.DataFrame({"variant":['chr1:123:C:T']}))
ht = ht.key_by(**hl.parse_variant(ht.variant))
pd_table = ht.to_pandas()
pd_table.to_pickle(os.path.join(bucket, 'test.pkl'))

The two examples below do not cause the same error.

ht = hl.Table.from_pandas(pd.DataFrame({"foo":['bar']}))
ht = hl.Table.from_pandas(pd.DataFrame({"foo":[1, 2, 3]}))

Hope this helps.

Nov 13 '23 15:11 anh151

This suggests to me that the dataframe created by hail maintains reference to hail objects and pandas is attempting to recreate these objects when unpickling. I suspect this is not intentional.

Hi @anh151, you are correct that to_pandas is creating dataframes that contain hail objects. In your example, the hail type in question is the Locus, but we also have a couple auxiliary classes like Interval and Call that could end up in the pandas table. I see how this can be unintuitive especially with your point of round-tripping through CSV (which uses the str of the object by default and thus will avoid the class lookup on read), but I hesitate to call it unintentional. I'll broach this question with the team as to what would be the least confusing behavior, but I suspect many users are using to_pandas results in the same hail session in which case it might be expected to get small hail objects in their result. In the meantime, you can look at the schema of your resultant table and translate within pandas to your desired representation, i.e. convert Locus entries to dicts.

Nov 15 '23 19:11 daniel-goldstein

Sounds good. Thanks for the feedback.

If it requires a significant effort to modify the code base to remove this behavior or if the change is not desired, it may be worth including a warning/info section describing this behavior in the documentation.

Nov 15 '23 20:11 anh151

hail hail copied to clipboard

Table.to_pandas appears to maintain reference to hail objects

What happened?

Version

Relevant log output

hail
hail copied to clipboard