CLAM icon indicating copy to clipboard operation
CLAM copied to clipboard

Fix: Slide ids turned into floats in split csv when names consist of only number

Open ff98li opened this issue 1 year ago • 0 comments

Summary of the Issue

  • Slide IDs consisting solely of numerical characters are inadvertently converted to floats in the split CSV files
    • The unequal lengths of train, val, and test splits introduce NaN values when these splits are concatenated into a dataframe by save_splits().
    • Pandas automatically converts columns with all-numeric names and NaN values to floats due to the lack of NaN rep in integer columns in Pandas. Screenshot 2024-02-26 at 1 27 58 PM
  • When loading via the following line, ValueError as shown in the screenshot will occur https://github.com/mahmoodlab/CLAM/blob/3f875f77465b410d260f2afcfaea608a9d6ddbca/datasets/dataset_generic.py#L247 Screenshot 2024-02-26 at 2 07 43 PM

Proposed fix

  • Cast slide IDs to strings before being saved to CSV in save_splits to prevent unintended type conversion.
    • Result: Screenshot 2024-02-26 at 2 40 21 PM
  • Continue to read the dataset CSV with dtype=object in Generic_WSI_Classification_Dataset.
    • In get_split_from_df(), cast the dtype of the corresponding split column to match that of self.slide_data['slide_id'].
    • This fix is pertaining https://github.com/mahmoodlab/CLAM/pull/90
    • Result: Screenshot 2024-02-26 at 3 01 36 PM

This happened when I was working with my own task's dataset csv. I can provide the csv file to reproduce this bug if needs be.

ff98li avatar Feb 26 '24 20:02 ff98li