pycytominer icon indicating copy to clipboard operation
pycytominer copied to clipboard

Bug: `SingleCells` class will not extract image features

Open jenna-tomkinson opened this issue 1 year ago • 1 comments

I have been using the SingleCells class from pycytominer cells.py file to extract image and object features measured with CellProfiler. I noticed that the image features were not being extracted into the outputted CSV file when I set these parameters:

sc = cells.SingleCells(
    sql_file=single_cell_file,
    compartments=["Per_Cells", "Per_Cytoplasm", "Per_Nuclei"],
    compartment_linking_cols=linking_cols,
    image_table_name="Per_Image",
    add_image_features='True',
    image_feature_categories=['Correlation', 'Granularity', "Texture", "Intensity"],
    strata=["Image_Metadata_Well", "Image_Metadata_Plate"],
    merge_cols=["ImageNumber"],
    image_cols="ImageNumber",
    load_image_data=True
)

The outputted CSV has the same number of columns when I have the add_image_features parameter set to true and false.

When I go into the code, the self.add_image_features is calling a function called extract_image_features, which says it should return two things:

    Returns
    -------
    image_features_df : pandas.core.frame.DataFrame
        Dataframe with extracted image features.
    image_feature_categories : list of str
        Correctly formatted image feature categories.

Firstly, this function only returns the image_features_df and does not return a list of correctly formatted image feature categories.

Second and most important, this function is returning an empty list due to how the function is formatted.

The extract_image_features function first uses the check_image_features function to determine if the list of categories given is within the image_df. The way it determines this is by checking if one of the categories in the list is within a column as the first index (e.g., if I have Correlation in my list, then it will not give an error if I have columns in the image_df that have the name Image_Correlation...). Since it isn't stated in the documentation, then this function is the only thing that tells me that the format of the image_feature_categories should look like this;

['Correlation', 'ImageQuality', 'Texture', ...]

The list contains the first index of the column names in the image_df.

But, when I use this list, the extract_image_features will return an empty list because this portion of the function:

    # Extract Image features from image_feature_categories
    image_features = list(
        image_df.columns[
            image_df.columns.str.startswith(tuple(image_feature_categories))
        ]
    )

The code block creates a list of image columns based on any column within the image_df that starts with any of the feature categories. Since the only way to pass the check_image_features function, the list must be formatted as I showed earlier. That means this function is trying to find any column that starts with this list, but when I go to the SQLite file exported from CellProfiler, all columns start with Image_.

What is very confusing to me is that the check_image_features function expects the category to be located in the first index of the column name, but the extract_image_features uses the startswith function which should never work for this situation because the columns prefix would be the zero index.

Based on all this, that means that this function or class will never output image features if the SQLite file has been directly exported from CellProfiler.

The only way I was able to get around this was by editing this function so that it doesn't use the check_image_features function and uses a list that is the same as the list above, but all strings within the list have the prefix of Image_. But this fix will create a separate CSV file with the metadata and image features and will not add the image features with the object features in one CSV which is what I am assuming was meant to happen in this class.

What would be the best way to edit this function to be more flexible to the CellProfiler SQLite output since this works for SQLite files using CellProfiler features collected differently?

jenna-tomkinson avatar Feb 28 '23 20:02 jenna-tomkinson