ImPartial
ImPartial copied to clipboard
Training time optimization
As an interactive service, ImPartial should allow users to train the model over their datasets in an efficient way, ideally under 5 mins using a single GPU. This training time is measured using 100 epochs and ~4000 sample patches per iteration.
As of today, it takes around 15 mins with the configurations mentioned above on a 4 GPUs machine.
As a first step.. let’s try to run all the transform in GPU.. after load you can use ensuretype transform and specify the device to get the image pushed to gpu
4000 samples can be cached.. so 2 to 100 epochs can run faster.. are you using cashed dataset?
Hi @SachidanandAlle these are great suggestions. One issue is that the first for the train_pre_transforms
, BlindSpotPatch belongs to the original ImPartial pipeline and is implemented using ndarrays. So the function blind_spot_patch()
https://github.com/nadeemlab/ImPartial/blob/1b05f0124c4eaca1bd04112821624a39ffa4ad73/dataprocessing/dataloaders.py#L333
would need to be migrated to run on GPU. I guess we can give that try. The rest of the transformations a part of MONAI so they should be good to run on GPU right?
Looking into Cached dataset now
Just for reference, this behavior matches the original ImPartial pipeline data loader https://github.com/nadeemlab/ImPartial/blob/e118ea827c021d9c10264b6beb348bf1c7d790ce/dataprocessing/dataloaders.py#L91 where after generating the blind spot mask it applies the list of transforms defined here https://github.com/nadeemlab/ImPartial/blob/1b05f0124c4eaca1bd04112821624a39ffa4ad73/impartial/Impartial_classes.py#L244-L251 This has to do with the way ImPartial computes the loss function and it's not possible to generate this mask when loading the patches, but @gunjan-sh can confirm this.
Major portion of the improvement comes from caching the loaded image.. random transforms can happen over image in memory..
Can you share a tiny dataset for me run the training? I can help you guys to optimize some of the steps.. for that help me to run the way it runs currently with small dataset
Looking at blind_spot_patch code.. it's much easier to convert it to use torch instead of numpy..
Was able to configure the BasicTrainingTask to use a CacheDataset. https://github.com/nadeemlab/ImPartial/blob/1675df8100b581d1fa791dc96e075badd2bb2667/impartial/api/lib/configs/impartial.py#L83
Testing this out now
@SachidanandAlle I have added a multichannel Vectra_WC_2CH_tiff dataset for you to try. Please let us know if you face any issues.
@ricdodds can you give your run configuration? Number of Patches per Epoch? Batch Size?
You mentioned:
This training time is measured using 100 epochs and ~4000 sample patches per iteration
At the end, what is the Total number of records you have generated for Training and Validation (after all pre-process, data partition) 4000 samples per iteration? What is the total number of samples per epoch?
If it's 4000 per iteration and if I assume you have 100 iterations per epoch.. and then 4 GPUs.. total records for training is around 4000 x 100 x 4 = 1,600,000 (1.6M) of 128x128 single/2-channel samples?
- roughly around 50G x 4 (float) bytes of data. I don't think this is the case...
Even in case of 4 GPU (multi-gpu) the total number of samples is logged by monailabel.. and then each GPU process prints how many records itself processing...
Will be helpful if you can attach the logs for your entire train (atleast for beginning to couple of epochs) where it took 15 mins kind.. in overall.
Also I notice that you are using UNET with base=64.. https://github.com/nadeemlab/ImPartial/blob/main/impartial/api/lib/configs/init.py
For 128x128 input size, do you really need 64.. 16 or 32 is more than enough. Even for 3D deepgrow/deepedit models (196x196x196) based on my previous evaluation, 32 was good enough. I couldn't see much improvements by using 64.
May be you should try 16 and if not great then 32 instead of 64.. That will reduce your network size + computation significantly.
And any specific reason you are not able use UNET/BasicUNET provided by MONAI Networks: https://github.com/Project-MONAI/MONAI/tree/dev/monai/networks/nets
As I suspected the validation batch is missing.. the default value is 1...
Please pass the batch size for validation. currently it's taking lot of time due as it's running one a time..
https://github.com/nadeemlab/ImPartial/blob/main/impartial/api/lib/configs/impartial.py#L76
"train_batch_size": self.iconfig.BATCH_SIZE,
"val_batch_size": self.iconfig.BATCH_SIZE,
This should solve some of the above problems... https://github.com/nadeemlab/ImPartial/pull/9
Following are the observations when tested on (1xA100, 40GB GPU) + 64 GB RAM + Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz with 10 cores and 2 threads per core:
class Config_CH2(ImPartialConfig):
def __init__(self):
super().__init__(
unet_base=64,
BATCH_SIZE=128,
n_channels=2,
npatches_epoch=4096,
classification_tasks={
'0': {'classes': 1, 'rec_channels': [0,1], 'ncomponents': [2, 2]}
}
)
Current main branch: latency for every epoch is around 22 seconds After improvements: latency for every epoch is around 7 seconds
Current default dataset is SmartCacheDataset..
To change the default behavior to CacheDataset, Option 1: you can set it for all in your init.
config={
"max_epochs": self.iconfig.EPOCHS,
"train_batch_size": self.iconfig.BATCH_SIZE,
"val_batch_size": self.iconfig.BATCH_SIZE,
"dataset": "CacheDataset",
"dataset_max_region": (10240, 10240),
"npatches_epoch": self.iconfig.npatches_epoch,
"dataset_limit": 0,
"dataset_randomize": True,
"early_stop_patience": self.iconfig.patience,
"pretrained": True,
"name": type(self.iconfig).__name__.lower()
},
Option 2: change per train request.
app.train(
request={
"model": "impartial",
"max_epochs": 2,
"dataset": "CacheDataset",
},
)