pycytominer icon indicating copy to clipboard operation
pycytominer copied to clipboard

`annotate` function kills kernel when trying to process a 13.1GB file

Open jenna-tomkinson opened this issue 1 year ago • 2 comments

This issue is related to issue #233.

I created a group of Python files to:

  1. Convert the two CellProfiler SQLite outputs into parquet and merge single cells using CytoTable convert.
  2. Merge the two parquet files into one using pandas concat because the two outputs should be the same file but were split due to a power outage stopping the CellProfiler run.
  3. Annotate the new combined parquet file with Pycytominer annotate
  4. Perform normalization with Pycytominer normalize
  5. Perform feature selection with Pycytominer feature_select

When the two parquet files were merged, the new parquet file is 13.1GB:

image

When attempting to run the scripts as described above, the kernel would be killed when attempting to run the annotate function:

image

This means that this function attempted to use about 102GB, while I only have about 49GB.

After talking with @axiomcura, he believes the issue might be arising in this part of the annotate function:

 if isinstance(external_metadata, pd.DataFrame):
        external_metadata.columns = [
            "Metadata_{}".format(x) if not x.startswith("Metadata_") else x
            for x in external_metadata.columns
        ]

        annotated = (
            annotated.merge(
                external_metadata,
                left_on=external_join_left,
                right_on=external_join_right,
                how="left",
            )
            .reset_index(drop=True)
            .drop_duplicates()
        )

Machine info

    OS: Pop!_OS 22.04 LTS
    CPU: AMD Ryzen 7 3700X 8-Core Processor
    Memory: 64 GB of RAM

@gwaybio @d33bs

jenna-tomkinson avatar Apr 26 '23 20:04 jenna-tomkinson