pycytominer
pycytominer copied to clipboard
Bug: `SingleCells` class will not extract image features
I have been using the SingleCells class from pycytominer cells.py file to extract image and object features measured with CellProfiler. I noticed that the image features were not being extracted into the outputted CSV file when I set these parameters:
sc = cells.SingleCells(
sql_file=single_cell_file,
compartments=["Per_Cells", "Per_Cytoplasm", "Per_Nuclei"],
compartment_linking_cols=linking_cols,
image_table_name="Per_Image",
add_image_features='True',
image_feature_categories=['Correlation', 'Granularity', "Texture", "Intensity"],
strata=["Image_Metadata_Well", "Image_Metadata_Plate"],
merge_cols=["ImageNumber"],
image_cols="ImageNumber",
load_image_data=True
)
The outputted CSV has the same number of columns when I have the add_image_features
parameter set to true and false.
When I go into the code, the self.add_image_features
is calling a function called extract_image_features
, which says it should return two things:
Returns
-------
image_features_df : pandas.core.frame.DataFrame
Dataframe with extracted image features.
image_feature_categories : list of str
Correctly formatted image feature categories.
Firstly, this function only returns the image_features_df
and does not return a list of correctly formatted image feature categories.
Second and most important, this function is returning an empty list due to how the function is formatted.
The extract_image_features
function first uses the check_image_features
function to determine if the list of categories given is within the image_df. The way it determines this is by checking if one of the categories in the list is within a column as the first index (e.g., if I have Correlation
in my list, then it will not give an error if I have columns in the image_df that have the name Image_Correlation...
). Since it isn't stated in the documentation, then this function is the only thing that tells me that the format of the image_feature_categories
should look like this;
['Correlation', 'ImageQuality', 'Texture', ...]
The list contains the first index of the column names in the image_df.
But, when I use this list, the extract_image_features
will return an empty list because this portion of the function:
# Extract Image features from image_feature_categories
image_features = list(
image_df.columns[
image_df.columns.str.startswith(tuple(image_feature_categories))
]
)
The code block creates a list of image columns based on any column within the image_df that starts with any of the feature categories. Since the only way to pass the check_image_features
function, the list must be formatted as I showed earlier. That means this function is trying to find any column that starts with this list, but when I go to the SQLite file exported from CellProfiler, all columns start with Image_
.
What is very confusing to me is that the check_image_features
function expects the category to be located in the first index of the column name, but the extract_image_features
uses the startswith
function which should never work for this situation because the columns prefix would be the zero index.
Based on all this, that means that this function or class will never output image features if the SQLite file has been directly exported from CellProfiler.
The only way I was able to get around this was by editing this function so that it doesn't use the check_image_features
function and uses a list that is the same as the list above, but all strings within the list have the prefix of Image_
. But this fix will create a separate CSV file with the metadata and image features and will not add the image features with the object features in one CSV which is what I am assuming was meant to happen in this class.
What would be the best way to edit this function to be more flexible to the CellProfiler SQLite output since this works for SQLite files using CellProfiler features collected differently?