ModelsGenesis icon indicating copy to clipboard operation
ModelsGenesis copied to clipboard

Are there detailed results that I could read?

Open FabianIsensee opened this issue 3 years ago • 7 comments

Hi there, I would really like to read some additional details about nnU-Net and models genesis. So far you seem to have taken the first place in Task03, but it is difficult so see whether that is a significant result (given that I am getting some variation when running the same training several times which will also translate into different test set performances). Overall, your submission on the Decathlon is below ours, indicating that the pretraining may not be beneficial on all tasks. This makes is difficult to really estimate the impact of your pre-training strategy.

Specifically, I would be interested in how much your pretrained models help in semi-supervised learning. Say you take all non-LiTS/Task03 (Task03 is essentially LiTS) datasets with livers in them (BCV Abdomen, KiTS, Pancreas (?), ...) and run models genesis on them for pretraining, how well does your pretrained nnU-Net perform when fine tuned on 10, 20, 50 etc LiTS cases for the LiTS task? Can you beat the nnU-Net baseline by a significant margin if you use all these additional datasets for pretraining?

Best, Fabian

FabianIsensee avatar Oct 16 '20 06:10 FabianIsensee

Hi Fabian,

Thank you for your comments, and we greatly admire your work on nnU-Net. Also, thank you for patiently answering the many questions from us (Shivam).

With the widespread success of the nnU-Net, we hypothesize that initializing these models with good starting points, particularly by learning representation from large-scale medical images via self-supervision, will boost nnU-Net performance, especially for those applications with limited annotation.

Regarding your comments, indeed, when learning from scratch and fine-tuning Models Genesis N times each, the best score of fine-tuning is not necessarily higher than that of learning from scratch. However, in the competition, the best model is selected and submitted among multiple local runs. As of now, we didn't claim in our papers that fine-tuning nnU-Net outperforms learning nnU-Net from scratch. Instead, our paper demonstrates that fine-tuning Models Genesis can lead to a stabilized performance with a higher average score given multiple runs. Recently, we conducted an experiment on the LiTS/Task03 for 3 runs. Here are the results of liver tumor segmentation evaluated on the validation set:

  • Reported in nnU-Net: 63.72%
  • Reproduced nnU-Net: 62.38%, 63.32%, 62.13% (62.61% +- 0.51%)
  • Pre-trained nnU-Net: 64.53%, 63.89%, 64.98% (64.47% +- 0.45%)

So far, we pre-trained nnU-Net on the LUNA 2016 dataset only, meaning that the pre-trained model saw none of the images in those target datasets. Given that the nnU-Net configuration differs from task to task, we are not sure if pre-training the architecture individually for each task would be an efficient solution. We think semi-supervised learning, as you suggested, is worth trying.

Thanks again for your comments, and we would greatly appreciate any further comments and suggestions that you may have.

Zongwei & Shivam

MrGiovanni avatar Oct 20 '20 20:10 MrGiovanni

Hi Zongwei & Shivam, thank you very much for your detailed response! It is very much appreciated! And please know that my intention was not to question the usefulness of pretraining - in fact I think this is one of the most promising directions of research at the moment and I would like to understand current approaches more. From your results, it appears like there is a positive effect of pretraining on the performance of nnU-Net. Have you tried training on less cases (both with and without pretraining) to see if you can widen that gap? Are the numbers you are reporting based on a five-fold cross-validation or are they generated with a single train-val split?

I can see how the different architectures for each dataset may be prohibitive, but there are ways of working around that. I am not quite sure how you implemented it, but you can try to just transfer the nnUNet_plan_and_preprocess result from LiTS/Task03 to any other dataset, causing them to be preprocessed in the same way (and also to have the same architecture). This will allow you to pretrain on anything and then apply the result to the target dataset.

Let me know if you have any questions. Best, Fabian

FabianIsensee avatar Oct 21 '20 06:10 FabianIsensee

You had me worried with your nnU-Net results repro :-D I was afraid that I broke something because you were unable to get the same Tumor Dice score as I was initially reporting. So I reran the trainings for 3D fullres (which is what you seem to be using) and I got 64.54 before and 65.53 after postprocessing (5-fold CV with the same splits as used for the number reported in our paper). So everything is fine. The small size of the datasets unfortunately causes some variations between runs

FabianIsensee avatar Oct 27 '20 11:10 FabianIsensee

Hi Zongwei & Shivam, thank you very much for your detailed response! It is very much appreciated! And please know that my intention was not to question the usefulness of pretraining - in fact I think this is one of the most promising directions of research at the moment and I would like to understand current approaches more. From your results, it appears like there is a positive effect of pretraining on the performance of nnU-Net. Have you tried training on less cases (both with and without pretraining) to see if you can widen that gap? Are the numbers you are reporting based on a five-fold cross-validation or are they generated with a single train-val split?

I can see how the different architectures for each dataset may be prohibitive, but there are ways of working around that. I am not quite sure how you implemented it, but you can try to just transfer the nnUNet_plan_and_preprocess result from LiTS/Task03 to any other dataset, causing them to be preprocessed in the same way (and also to have the same architecture). This will allow you to pretrain on anything and then apply the result to the target dataset.

Let me know if you have any questions. Best, Fabian

Hi FabianIsensee, The nnUNet_plan_and_preprocess result contains a bunch of npy npz pkl files and I don't know how to process these files for pretraining. Can I integrate the pretraining process into nnUNet_train, and enable different loss functions and network architecture in different training stages?

tea321000 avatar Oct 28 '20 03:10 tea321000

Hi, npy is simple the unpackes npz file for faster reading (mmap_mode='r'). See the nnU-Net dataloaders for that. The pkl file contains some important information, for example the image geometry as well as where the classes are located in the image. Again, look at the nnU-Net dataloader to see the details. If you intend to use pretraining then all you need is the npy files. They are already resampled and intensity normalized. Note that the data in npy is always 4d (c, x, y, z), that the last channel in c is the segmentation (so if the data has only one modalitz, c=2 because index 0 is the image and 1 is the segmentation). 2D slices are always sliced via a[:, SLICE_IDX, :,:]. The data was transposed to have the in-plane axis in the trailing dimensions. It is difficult to fully explain everything. I can only emphasize that you should look into the nnunet dataloaders :-) Best, Fabian

FabianIsensee avatar Oct 28 '20 09:10 FabianIsensee

Hi Fabian,

The reproduced results that we have shared with you (62.61% +- 0.51%) are evaluated from 5-fold CV, which should correspond to "0.6372" in Table F.6 of https://arxiv.org/pdf/1904.08128.pdf. Unfortunately, as of now, we are not able to produce a dice score of 64.54% before and 65.53% after postprocessing with 3D_fullres configuration.

As your reference, we attached some of the configurations

  • HPE Apollo 6500 server
  • One Nvidia Tesla v100 16GB
  • Batch size = 2
  • Patch size = 128x128x128

May we ask you to fine-tune our released pre-trained nnU-Net weights under your environment setup and see if there is performance gain? Besides, it would be really helpful if you could provide the mean and standard deviation of the validation performance if you have multiple runs for learning nnU-Net from scratch? We have noticed that performance fluctuation occurs, especially when training with varying environment configurations. Hence, stabilizing and elevating the overall performance is the primary purpose of developing Models Genesis.

Thanks, Zongwei & Shivam

MrGiovanni avatar Nov 02 '20 21:11 MrGiovanni

Hi, have you used the most recent version of nnU-Net to produce these results? I ran my experiments last week with the most recent master. If you want I can run the experiments several times so that I can report reproducibility. But to be honest I don't think this is related to the hardware setup at all! The difference between runs is just the inherent variation of the dataset. Even though LiTS is quite large, it is still very small in comparison to datasets from other domains. I think pretraining can improve the results overall, but I don't think it can reduce the spread of the results substantially. You would need to run a lot of experiments to be able to make that claim. Other experiment frameworks will probably give much less variation because they only use deterministic implementations in cuDNN and seed all their experiments. I think that is a big mistake - it will give a false security. And one will be tempted to publish results ('look we outperformed XXX on dataset YYY by a small margin') when in reality one is just overfitting to the random seed. I regard the randomness as a good thing - it tells me what I don't know. Can I find a random seed where special U-Net is better than nnU-net? Yes. Does that translate to other seeds? No. So why bother ^^ I am sorry that I cannot really help you. I wish I could. If it helps you I can run five 5-fold cvs with the 3d_fullres configuration and report the spread. I can also control what hardware it runs on in case you are interested in that information. But know that I never distinguish between hardware setups: it just runs on whatever is available: 2080ti, Titan RTX, V100 16GB, V100 32GB, RTX 6000. Best, Fabian

FabianIsensee avatar Nov 03 '20 14:11 FabianIsensee