MIScnn icon indicating copy to clipboard operation
MIScnn copied to clipboard

Training with patch-wise crop

Open MLRadfys opened this issue 4 years ago • 4 comments

Hi,

I just noticed that if you set the 'prepare-batches' flag while training with patch-wise cropping, the network uses the same patches for each epoch instead of creating a new 'patch-based' training set. I would have though that new patches are cropped randomly when a new epoch starts.

Is there any reason for that?

(If the prepare batches flag is not set it works like expected)

Thanks in advance,

best regards,

Michael

MLRadfys avatar Jun 08 '20 06:06 MLRadfys

Heyho MichaelLempart,

it works as expected, even though I have to admit that the naming is not the most intuitive. ;)

The intention behind it: I wanted to created an option to switch between two modi: The traditional and the on-the-fly approach.

  1. Traditional appraoch: Traditionally, you have a data set -> you perform data augmentation -> your data set is now exactly 10x times bigger due to you augmented each single image into 10 variations. This is also normally performed on file level. Therefore, instead of 50 image files, we have 500 image files, now.

Example: Data set size = 50 -> after data augmentation before fitting -> data set size 500 -> 500 iterations

In MIScnn this approach should looks something like this:

from miscnn import Data_Augmentation, Preprocessor

data_aug = Data_Augmentation(cycles=10)
pp = Preprocessor(data_io, data_aug=data_aug, batch_size=2, prepare_batches=True)
  1. On-the-fly approach: In recent years, one-the-fly approaches became more and more popular (and also showed way higher performance!). The idea behind this approach is that you use your data set as a variation database. This means, that in an iteration during the fitting process, you pull an image X then (during the fitting!) perform the data augmentation -> obtain a new and unique image -> train on it -> discard it. After an epoch, you will return to image X -> pull it again -> perform data augmentation -> obtain a new and again unique image which the model has not seen before -> train on it -> discard.

This method allows training on new and unique images in each iteration which strongly reduce the risk of overfitting and therefore result into more powerful models even with limited data set sizes.

Sounds nice, but has the huge advantage that you will have to perform the CPU-computing intensive data augmentation step during fitting in each iteration. This can result into a huge bottleneck during training. You have to ensure that your next on-the-fly batch will be created fast enough before the fitting process of the current is finished. Else you will throttle you training speed.

Example: Data set size = 50 -> no data augmentation before fitting -> data set size 50 -> 50 iterations (new data augmentation on each iteration/image during fitting)

In MIScnn this approach should looks something like this:

from miscnn import Data_Augmentation, Preprocessor

data_aug = Data_Augmentation(cycles=1)
pp = Preprocessor(data_io, data_aug=data_aug, batch_size=2, prepare_batches=False)

Back to the question What has this to-do with the patch-wise crop analysis type and how is this realised in MIScnn?

The traditional/on-the-fly approach SHOULD NOT be associated with the analysis type.

BUT: It is in MIScnn, sadly.

Currently, for the traditional (prepare_batches=True) approach, MIScnn computes the ready-for-the-gpu batches beforehand and stores them to disk. Therefore, the patchwise-cropping is also done beforehand and only once.

The main idea, at the time I implemented it, was to compute batches beforehand instead of during training runtime.

Therefore, this behaviour is expected (due to the continuous development from scratch of MIScnn), but not correct!

I will added this to my to-do list / agenda for reworking prepare_batches to some kind of onthefly_dataaug option. From an categorical point of view, it can also make sense to completely move the option into the data augmentation class instead of the preprocessor (even if most of the code for this feature is located there).

Hope that I was able to help you. And big thanks for pointing this issue out! :)

Cheers, Dominik

muellerdo avatar Jun 08 '20 10:06 muellerdo

Thanks for this detailed explanation,

yes, I noticed that the cropping and augmentation can become a huge bottleneck when training on the fly. Maybe some multiprocessing could increase the performance.

Another thing that would be great (which probably should be written in a separate post) would be to be able to re-use an already prepared dataset which has been saved to disk.

As far as I have understood all batches are prepared over and over again, when you restart a training session. Maybe a simple function to read the batch folder and replacing the sample list with files found in the folder would be enough?

Best regards,

Michael

MLRadfys avatar Jun 08 '20 10:06 MLRadfys

yes, I noticed that the cropping and augmentation can become a huge bottleneck when training on the fly. Maybe some multiprocessing could increase the performance.

Currently, the SingleThreaded interface from batchgenerators is implemented. But you are right, I added it to my agenda to test out the MultiThreaded interface.

Another thing that would be great (which probably should be written in a separate post) would be to be able to re-use an already prepared dataset which has been saved to disk. As far as I have understood all batches are prepared over and over again, when you restart a training session. Maybe a simple function to read the batch folder and replacing the sample list with files found in the folder would be enough?

This idea is fantastic. :) I had something like that already in mind when implementing the seed for temporary files in the batches directory. Therefore, it should be quite easy to implement: Add an option to the Data IO class for specifying a seed. If there are already existing files with the specified seed, then reuse these. I will open an issue and add this to my agenda.

Cheers, Dominik

muellerdo avatar Jun 08 '20 10:06 muellerdo

This idea is fantastic. :) I had something like that already in mind when implementing the seed for temporary files in the batches directory. Therefore, it should be quite easy to implement: Add an option to the Data IO class for specifying a seed. If there are already existing files with the specified seed, then reuse these. I will open an issue and add this to my agenda.

Sounds great :-) Let me know if you want me to contribute and help in some way!

MLRadfys avatar Jun 08 '20 11:06 MLRadfys