Performance difference between `v0.9.3` <-> `v0.10.0`
🐛 Bug
Torchmetrics works on different speed accross v0.9.3 and v0.10.0
To Reproduce
Same metric calculation on different versions installed environments.
Results for
v0.9.3
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Action | Mean duration (s) | Num calls | Total time (s) | Percentage % |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Total | - | 1042 | 17.113 | 100 % |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| [LightningDataModule]SegmentationDataModule.prepare_data | 5.1728 | 1 | 5.1728 | 30.228 |
| run_training_epoch | 4.6598 | 1 | 4.6598 | 27.23 |
| [Strategy]SingleDeviceStrategy.validation_step | 0.28901 | 12 | 3.4681 | 20.266 |
| [LightningDataModule]SegmentationDataModule.setup | 2.2177 | 1 | 2.2177 | 12.959 |
| run_training_batch | 0.16763 | 10 | 1.6763 | 9.7954 |
| [LightningModule]Model.optimizer_step | 0.16661 | 10 | 1.6661 | 9.7361 |
| [TrainingEpochLoop].train_dataloader_next | 0.13849 | 10 | 1.3849 | 8.0928 |
| [Strategy]SingleDeviceStrategy.backward | 0.077044 | 10 | 0.77044 | 4.5021 |
| [Strategy]SingleDeviceStrategy.training_step | 0.056393 | 10 | 0.56393 | 3.2954 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Results for
v0.10.0
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Action | Mean duration (s) | Num calls | Total time (s) | Percentage % |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Total | - | 1042 | 85.772 | 100 % |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| run_training_epoch | 62.396 | 1 | 62.396 | 72.747 |
| [Strategy]SingleDeviceStrategy.validation_step | 3.159 | 12 | 37.908 | 44.196 |
| run_training_batch | 3.6079 | 10 | 36.079 | 42.064 |
| [LightningModule]Model.optimizer_step | 3.6072 | 10 | 36.072 | 42.056 |
| [Strategy]SingleDeviceStrategy.training_step | 3.5262 | 10 | 35.262 | 41.112 |
| [LightningDataModule]SegmentationDataModule.prepare_data | 5.1832 | 1 | 5.1832 | 6.043 |
| [LightningDataModule]SegmentationDataModule.setup | 2.2608 | 1 | 2.2608 | 2.6358 |
| [TrainingEpochLoop].train_dataloader_next | 0.1303 | 10 | 1.303 | 1.5191 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Code sample
Since its private project, I can only say the used metric names and configurations
For training:
metric_params = {
"num_classes": self.num_classes,
"average": None,
"mdmc_average": "samplewise",
}
train = tm.MetricCollection(
{
"IoU": tm.JaccardIndex(**metric_params),
"DSC": tm.Dice(**metric_params),
},
prefix="Train/Seg/",
)
validation tm.MetricCollection(
{
"IoU": tm.JaccardIndex(**metric_params),
"DSC": tm.Dice(**metric_params),
"Spec": tm.Specificity(**metric_params),
"Sens": tm.Recall(**metric_params),
},
prefix="Val/Seg/",
)
Expected behavior
More performant results.
Environment
- TorchMetrics version
v0.9.3andv0.10.0 - Python
3.8.10, PyTorch1.12.1+cu116 -
Linux 5.15.0-1018-gcp #24~20.04.1-Ubuntu SMP Mon Sep 12 06:14:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
I'm seeing something similar, particularly with JaccardIndex.
Hi @omerferhatt and @tayden
Could one of you provided the precise configuration you are using (num_classes) and what the shape of typical input to the metric looks like?
Just to know if you are running into an edge case that we had previously not thought about :]
Absolutely @SkafteNicki
num_classes=5
preds: (8, 5, 256, 256), dtype=float32, min=0, max=1 => Output softmax
target: (8, 256, 256), dtype=int64 => Multi-class labels
Similar to @omerferhatt, I have:
iou_metric = JaccardIndex(num_classes=3, ignore_index=2, average="none")
probs = (2, 3, 512, 512), dtype=torch.float32 => softmax outputs
target: (2, 512, 512), dtype=torch.uint8 => multi-class labels (0,1, or 2 as values)
I am seeing extreme slowdown with MatthewsCorrCoef too. What used to take less than a second for me now takes 10 minutes! Reverting back to 0.9.0 or 0.8.2 works just fine.