Wsi-Caption questions about the WSIs+caption paired data

Thanks for the amazing work!

I have a question regarding the image slides. For each patient, there can be more than one slide. For example, TCGA-5T-A9QA case has both the TCGA-5T-A9QA-01A-01-TSA and TCGA-5T-A9QA-01Z-00-DX1 slides. How do you pair these data with the caption during model training?

Sep 16 '24 17:09 pxliang

we use the“DX" slide.

Sep 17 '24 00:09 cpystan

we use the“DX" slide.

Thank you for your great work! I have a question. "DX" case also has more than one slide. For example, "TCGA-D8-A3Z5" has "TCGA-D8-A3Z5-01Z-00-DX1", "TCGA-D8-A3Z5-01Z-00-DX2" and "TCGA-D8-A3Z5-01Z-00-DX3". But there is only a report belonging to "TCGA-D8-A3Z5". Is the report used for the three slides?

Oct 31 '24 13:10 51265904017

Yes. Some cases have several DX slides. For this situation, we choose 'DX1'.

Nov 01 '24 02:11 cpystan

Yes. Some cases have several DX slides. For this situation, we choose 'DX1'.

Thank you for your reply. I have another question. Do you use the "splits_0.csv" as the dataset splitting in your experiment？The train and val have the same case. For example, train has "TCGA-D8-A73X-01Z-00-DX1" and val has "TCGA-D8-A73X-01Z-00-DX2". So if you only choose 'DX1', how do you deal with the problem? Do you delete "TCGA-D8-A73X-01Z-00-DX2" in the val? If you delete all the "DX2","DX3","DX4", there are only 977 slides in the BRCA dataset.

Nov 03 '24 05:11 51265904017

We ignore the same case in the val or test. So the total slides will be a bit fewer.

Nov 04 '24 03:11 cpystan

Why was the DX-labeled WSI slice chosen to be kept while other slices were deleted? Because this is a "multiple slices-single report" paired dataset, and the other slices also contribute to the formation of the diagnostic report.

May 22 '25 05:05 xinfzhang

Why was the DX-labeled WSI slice chosen to be kept while other slices were deleted? Because this is a "multiple slices-single report" paired dataset, and the other slices also contribute to the formation of the diagnostic report.

'DX' means diagnostic slides which are acquired at high resolution scanning (usually 40x magnification), which is suitable for pathologists' diagnosis and analysis.

May 22 '25 11:05 cpystan