pycytominer
pycytominer copied to clipboard
`annotate` function kills kernel when trying to process a 13.1GB file
This issue is related to issue #233.
I created a group of Python files to:
- Convert the two CellProfiler SQLite outputs into parquet and merge single cells using CytoTable
convert
. - Merge the two parquet files into one using pandas concat because the two outputs should be the same file but were split due to a power outage stopping the CellProfiler run.
- Annotate the new combined parquet file with Pycytominer
annotate
- Perform normalization with Pycytominer
normalize
- Perform feature selection with Pycytominer
feature_select
When the two parquet files were merged, the new parquet file is 13.1GB:
When attempting to run the scripts as described above, the kernel would be killed when attempting to run the annotate
function:
This means that this function attempted to use about 102GB, while I only have about 49GB.
After talking with @axiomcura, he believes the issue might be arising in this part of the annotate function:
if isinstance(external_metadata, pd.DataFrame):
external_metadata.columns = [
"Metadata_{}".format(x) if not x.startswith("Metadata_") else x
for x in external_metadata.columns
]
annotated = (
annotated.merge(
external_metadata,
left_on=external_join_left,
right_on=external_join_right,
how="left",
)
.reset_index(drop=True)
.drop_duplicates()
)
Machine info
OS: Pop!_OS 22.04 LTS
CPU: AMD Ryzen 7 3700X 8-Core Processor
Memory: 64 GB of RAM
@gwaybio @d33bs