research-contributions icon indicating copy to clipboard operation
research-contributions copied to clipboard

About Pretraining Data Formats

Open WYC-321 opened this issue 3 years ago • 7 comments

I downloaded the dataset for pre-training on TCIA, but I found that the downloaded data format is .dcm, which is inconsistent with the format of .nii.gz in the json file. I wonder if something is done to do the format conversion?

WYC-321 avatar Jun 06 '22 15:06 WYC-321

Hi @WYC-321

We did convert the Dicom files to nifti. In addition, we filtered out some of the outlier cases according to the information provided in the meta info. Please see the json files containing the exact train/val splits in here.

Thanks

ahatamiz avatar Jun 06 '22 20:06 ahatamiz

Hi, @ahatamiz : Thank you for your answer. After looking at the dataset I have some more detailed questions: (1). Dicom files are simply converted to nifti without any additional processing ?

I noticed that the naming rules in the json file are different from the naming rules of the database. For example, in dataset_TCIAcolon_v2_0.json file, the images are named like this: img_19.nii.gz, but in the TCIA CT Colonography Trial database, the directory paths are like this: CT COLONOGRAPHY\1.3.6.1.4.1.9328.50.4.0019\01-01-2000-1-CT ABD WCONT RECONSTRUCTION-18588. I'm guessing that the 0019 in 1.3.6.1.4.1.9328.50.4.0019 refers to img_19, but there are five subfolders under this directory: 1.000000-NA-18589 (including 1 dicom file),3.000000-NA-18592 (including 482 dicom files),5.000000-NA-19075 (including 1 dicom file),7.000000-NA-19078 (including 438 dicom files),9.000000-NA-19517 (including 1 dicom file),11.000000-NA-19520 (including 444 dicom files). So even though I have the json file, I still don't know img19.nii.gz refers to which subfolder. (All data in five subfolders ? Or data in one subfolder ?). There are similar situations for other datasets. And the questions are as follows: (2). How can I link the files in the original database with the files described by json? (3). Some subfolders contain multiple Dicom slices, just concatenate them in order and convert them to a nifti file ? (4). Given the complexity of the details, is it possible to expose a script that converts the raw data to the data described in json file ?

Finally, thanks again for your excellent work and contributions to open source code.

Best wishes !

WYC-321 avatar Jun 07 '22 06:06 WYC-321

Hi @WYC-321,

I believe the best way to address your questions is to release the pre-processing pipeline. I have raised the issue regarding this with our team members and the code for pre-processing shall be released very soon.

CC: @wyli

Best

ahatamiz avatar Jun 13 '22 14:06 ahatamiz

Thanks a lot to your team.

WYC-321 avatar Jun 28 '22 09:06 WYC-321

@WYC-321 I have the same issue with the code. Could you manage to work it out?

Jamshidhsp avatar Oct 12 '22 08:10 Jamshidhsp

I also download the datasets and try to follow the split in the JSON file. However, for HSNCC as well as TCIAcolon, it's hard to convert to the required nifty file from the downloaded dataset. Because I can't find the corresponding relationship.

JiaxinZhuang avatar Nov 16 '22 09:11 JiaxinZhuang

@JiaxinZhuang @WYC-321 did you manage to figure it out? I'm also struggling with the naming relationship for the datasets (HNSCC and COLON).

JakobDexl avatar Feb 20 '23 11:02 JakobDexl