ImPartial icon indicating copy to clipboard operation
ImPartial copied to clipboard

Training time optimization

Open ricdodds opened this issue 2 years ago • 11 comments

As an interactive service, ImPartial should allow users to train the model over their datasets in an efficient way, ideally under 5 mins using a single GPU. This training time is measured using 100 epochs and ~4000 sample patches per iteration.

As of today, it takes around 15 mins with the configurations mentioned above on a 4 GPUs machine.

ricdodds avatar Oct 21 '22 20:10 ricdodds

As a first step.. let’s try to run all the transform in GPU.. after load you can use ensuretype transform and specify the device to get the image pushed to gpu

4000 samples can be cached.. so 2 to 100 epochs can run faster.. are you using cashed dataset?

SachidanandAlle avatar Oct 24 '22 05:10 SachidanandAlle

Hi @SachidanandAlle these are great suggestions. One issue is that the first for the train_pre_transforms, BlindSpotPatch belongs to the original ImPartial pipeline and is implemented using ndarrays. So the function blind_spot_patch()

https://github.com/nadeemlab/ImPartial/blob/1b05f0124c4eaca1bd04112821624a39ffa4ad73/dataprocessing/dataloaders.py#L333

would need to be migrated to run on GPU. I guess we can give that try. The rest of the transformations a part of MONAI so they should be good to run on GPU right?

Looking into Cached dataset now

ricdodds avatar Oct 25 '22 15:10 ricdodds

Just for reference, this behavior matches the original ImPartial pipeline data loader https://github.com/nadeemlab/ImPartial/blob/e118ea827c021d9c10264b6beb348bf1c7d790ce/dataprocessing/dataloaders.py#L91 where after generating the blind spot mask it applies the list of transforms defined here https://github.com/nadeemlab/ImPartial/blob/1b05f0124c4eaca1bd04112821624a39ffa4ad73/impartial/Impartial_classes.py#L244-L251 This has to do with the way ImPartial computes the loss function and it's not possible to generate this mask when loading the patches, but @gunjan-sh can confirm this.

ricdodds avatar Oct 25 '22 15:10 ricdodds

Major portion of the improvement comes from caching the loaded image.. random transforms can happen over image in memory..

Can you share a tiny dataset for me run the training? I can help you guys to optimize some of the steps.. for that help me to run the way it runs currently with small dataset

SachidanandAlle avatar Oct 25 '22 16:10 SachidanandAlle

Looking at blind_spot_patch code.. it's much easier to convert it to use torch instead of numpy..

SachidanandAlle avatar Oct 25 '22 16:10 SachidanandAlle

Was able to configure the BasicTrainingTask to use a CacheDataset. https://github.com/nadeemlab/ImPartial/blob/1675df8100b581d1fa791dc96e075badd2bb2667/impartial/api/lib/configs/impartial.py#L83

Testing this out now

ricdodds avatar Oct 25 '22 18:10 ricdodds

@SachidanandAlle I have added a multichannel Vectra_WC_2CH_tiff dataset for you to try. Please let us know if you face any issues.

gunjan-sh avatar Oct 27 '22 17:10 gunjan-sh

@ricdodds can you give your run configuration? Number of Patches per Epoch? Batch Size?

You mentioned:

This training time is measured using 100 epochs and ~4000 sample patches per iteration

At the end, what is the Total number of records you have generated for Training and Validation (after all pre-process, data partition) 4000 samples per iteration? What is the total number of samples per epoch?

If it's 4000 per iteration and if I assume you have 100 iterations per epoch.. and then 4 GPUs.. total records for training is around 4000 x 100 x 4 = 1,600,000 (1.6M) of 128x128 single/2-channel samples? - roughly around 50G x 4 (float) bytes of data. I don't think this is the case...

Even in case of 4 GPU (multi-gpu) the total number of samples is logged by monailabel.. and then each GPU process prints how many records itself processing...

Will be helpful if you can attach the logs for your entire train (atleast for beginning to couple of epochs) where it took 15 mins kind.. in overall.

SachidanandAlle avatar Oct 28 '22 00:10 SachidanandAlle

Also I notice that you are using UNET with base=64.. https://github.com/nadeemlab/ImPartial/blob/main/impartial/api/lib/configs/init.py

For 128x128 input size, do you really need 64.. 16 or 32 is more than enough. Even for 3D deepgrow/deepedit models (196x196x196) based on my previous evaluation, 32 was good enough. I couldn't see much improvements by using 64.

May be you should try 16 and if not great then 32 instead of 64.. That will reduce your network size + computation significantly.

And any specific reason you are not able use UNET/BasicUNET provided by MONAI Networks: https://github.com/Project-MONAI/MONAI/tree/dev/monai/networks/nets

SachidanandAlle avatar Oct 28 '22 00:10 SachidanandAlle

As I suspected the validation batch is missing.. the default value is 1...

Please pass the batch size for validation. currently it's taking lot of time due as it's running one a time..

https://github.com/nadeemlab/ImPartial/blob/main/impartial/api/lib/configs/impartial.py#L76

                "train_batch_size": self.iconfig.BATCH_SIZE,
                "val_batch_size": self.iconfig.BATCH_SIZE,

SachidanandAlle avatar Oct 28 '22 00:10 SachidanandAlle

This should solve some of the above problems... https://github.com/nadeemlab/ImPartial/pull/9

Following are the observations when tested on (1xA100, 40GB GPU) + 64 GB RAM + Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz with 10 cores and 2 threads per core:

class Config_CH2(ImPartialConfig):
    def __init__(self):
        super().__init__(
            unet_base=64,
            BATCH_SIZE=128,
            n_channels=2,
            npatches_epoch=4096,
            classification_tasks={
                '0': {'classes': 1, 'rec_channels': [0,1], 'ncomponents': [2, 2]}
            }
        )

Current main branch: latency for every epoch is around 22 seconds After improvements: latency for every epoch is around 7 seconds

Current default dataset is SmartCacheDataset..

To change the default behavior to CacheDataset, Option 1: you can set it for all in your init.

            config={
                "max_epochs": self.iconfig.EPOCHS,
                "train_batch_size": self.iconfig.BATCH_SIZE,
                "val_batch_size": self.iconfig.BATCH_SIZE,
                "dataset": "CacheDataset",
                "dataset_max_region": (10240, 10240),
                "npatches_epoch": self.iconfig.npatches_epoch,
                "dataset_limit": 0,
                "dataset_randomize": True,
                "early_stop_patience": self.iconfig.patience,
                "pretrained": True,
                "name": type(self.iconfig).__name__.lower()
            },

Option 2: change per train request.

    app.train(
        request={
            "model": "impartial",
            "max_epochs": 2,
            "dataset": "CacheDataset",
        },
    )

SachidanandAlle avatar Oct 28 '22 06:10 SachidanandAlle