hail
hail copied to clipboard
Table.to_pandas appears to maintain reference to hail objects
What happened?
Hello, I have some gwas results that I have converted into a pandas dataframe. Typically with dataframes I pickle my outputs for speed and easily maintaining data types. Within All of Us we have separate analysis environments whether we're using hail or not. The environment that doesn't have hail is much cheaper for simple analyses and does not have pyspark installed. You can see in the error below when I try to reread the pickled dataframe, I get an error that it can't find the pyspark from within a hail module. If I write the dataframe as a csv, read it back in, and then pickle it then the error goes away. This suggests to me that the dataframe created by hail maintains reference to hail objects and pandas is attempting to recreate these objects when unpickling. I suspect this is not intentional.
# Hail environment
vat_simplified_file = os.path.join(bucket, 'vat.ht')
gwas = hl.read_table(gwas_results_file_no_sex_chr)
vat = hl.read_table(vat_simplified_file)
gwas = gwas.filter(gwas.p_value <= 1e-4)
combined = gwas.join(vat, how='left')
combined_pandas = combined.to_pandas()
gwas_pandas_file = os.path.join(bucket, 'gwas_results.pkl')
combined_pandas.to_pickle(gwas_pandas_file)
# Non hail environment without pyspark
combined_pandas = pd.read_pickle(gwas_pandas_file)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression, storage_options)
216 # expected "IO[bytes]"
--> 217 return pickle.load(handles.handle) # type: ignore[arg-type]
218 except excs_to_catch:
/opt/conda/lib/python3.7/site-packages/hail/__init__.py in <module>
32 # E402 module level import not at top of file
---> 33 from .table import Table, GroupedTable, asc, desc # noqa: E402
34 from .matrixtable import MatrixTable, GroupedMatrixTable # noqa: E402
/opt/conda/lib/python3.7/site-packages/hail/table.py in <module>
4 import numpy as np
----> 5 import pyspark
6 from typing import Optional, Dict, Callable, Sequence
ModuleNotFoundError: No module named 'pyspark'
During handling of the above exception, another exception occurred:
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_233/4275665471.py in <module>
----> 1 combined_pandas = pd.read_pickle(gwas_pandas_file)
/opt/conda/lib/python3.7/site-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression, storage_options)
220 # "No module named 'pandas.core.sparse.series'"
221 # "Can't get attribute '__nat_unpickle' on <module 'pandas._libs.tslib"
--> 222 return pc.load(handles.handle, encoding=None)
223 except UnicodeDecodeError:
224 # e.g. can occur for files written in py27; see GH#28645 and GH#31988
/opt/conda/lib/python3.7/site-packages/pandas/compat/pickle_compat.py in load(fh, encoding, is_verbose)
272 up.is_verbose = is_verbose
273
--> 274 return up.load()
275 except (ValueError, TypeError):
276 raise
/opt/conda/lib/python3.7/pickle.py in load(self)
1086 raise EOFError
1087 assert isinstance(key, bytes_types)
-> 1088 dispatch[key[0]](self)
1089 except _Stop as stopinst:
1090 return stopinst.value
/opt/conda/lib/python3.7/pickle.py in load_stack_global(self)
1383 if type(name) is not str or type(module) is not str:
1384 raise UnpicklingError("STACK_GLOBAL requires str")
-> 1385 self.append(self.find_class(module, name))
1386 dispatch[STACK_GLOBAL[0]] = load_stack_global
1387
/opt/conda/lib/python3.7/site-packages/pandas/compat/pickle_compat.py in find_class(self, module, name)
204 key = (module, name)
205 module, name = _class_locations_map.get(key, key)
--> 206 return super().find_class(module, name)
207
208
/opt/conda/lib/python3.7/pickle.py in find_class(self, module, name)
1424 elif module in _compat_pickle.IMPORT_MAPPING:
1425 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1426 __import__(module, level=0)
1427 if self.proto >= 4:
1428 return _getattribute(sys.modules[module], name)[0]
/opt/conda/lib/python3.7/site-packages/hail/__init__.py in <module>
31 # F401 '.expr.*' imported but unused
32 # E402 module level import not at top of file
---> 33 from .table import Table, GroupedTable, asc, desc # noqa: E402
34 from .matrixtable import MatrixTable, GroupedMatrixTable # noqa: E402
35 from .expr import * # noqa: F401,F403,E402
/opt/conda/lib/python3.7/site-packages/hail/table.py in <module>
3 import pandas
4 import numpy as np
----> 5 import pyspark
6 from typing import Optional, Dict, Callable, Sequence
7
ModuleNotFoundError: No module named 'pyspark'
Thanks, Andrew
Version
0.2.107-2387bb00ceee
Relevant log output
No response
Adding a simple reproducible example.
ht = hl.Table.from_pandas(pd.DataFrame({"variant":['chr1:123:C:T']}))
ht = ht.key_by(**hl.parse_variant(ht.variant))
pd_table = ht.to_pandas()
pd_table.to_pickle(os.path.join(bucket, 'test.pkl'))
The two examples below do not cause the same error.
ht = hl.Table.from_pandas(pd.DataFrame({"foo":['bar']}))
ht = hl.Table.from_pandas(pd.DataFrame({"foo":[1, 2, 3]}))
Hope this helps.
This suggests to me that the dataframe created by hail maintains reference to hail objects and pandas is attempting to recreate these objects when unpickling. I suspect this is not intentional.
Hi @anh151, you are correct that to_pandas
is creating dataframes that contain hail objects. In your example, the hail type in question is the Locus
, but we also have a couple auxiliary classes like Interval
and Call
that could end up in the pandas table. I see how this can be unintuitive especially with your point of round-tripping through CSV (which uses the str
of the object by default and thus will avoid the class lookup on read), but I hesitate to call it unintentional. I'll broach this question with the team as to what would be the least confusing behavior, but I suspect many users are using to_pandas
results in the same hail session in which case it might be expected to get small hail objects in their result. In the meantime, you can look at the schema of your resultant table and translate within pandas to your desired representation, i.e. convert Locus entries to dicts.
Sounds good. Thanks for the feedback.
If it requires a significant effort to modify the code base to remove this behavior or if the change is not desired, it may be worth including a warning/info section describing this behavior in the documentation.