InnerEye-DeepLearning issues

Jobs file when they get pre-empted after uploading files

Files get uploaded in run_ml.py register_model function via `upload_folder`. Then job gets pre-empted, starts again and tries to upload files again. At that point, it complains that files already exist...

ant0nsc

Allow batch sizes > 1 for classification model inference.

1

At the moment, we set batch size =1 in the dataloaders when running inference for a classification model. https://github.com/microsoft/InnerEye-DeepLearning/blob/daefdba6083775de7ca258d18ae315e57bcb54bd/InnerEye/ML/model_testing.py#L428 [AB#3998](https://innereye.visualstudio.com/60ce1777-00d6-4015-82bc-488a0c00202f/_workitems/edit/3998)

Shruthi42

Add file synchronization support for multiple nodes

2

How can we synchronize files that are written during multi-node training? * At the end of training, each node reads the file in question, turns in to byte tensor *...

ant0nsc

Add support for model finetuning

1

Allow finetuning an existing model on a new dataset/task. This includes support for changing the architecture (eg. swapping out the last layer) or freezing a set of weights. [AB#3921](https://innereye.visualstudio.com/60ce1777-00d6-4015-82bc-488a0c00202f/_workitems/edit/3921)

Shruthi42

Run recovery for hyperdrive runs does not work

`building_models.md` says that it is possible to recover a failed Hyperdrive crossval run, but this does not work. ``` File "innereye-deeplearning/InnerEye/ML/run_ml.py", line 224, in setup self.checkpoint_handler.download_recovery_checkpoints_or_weights(only_return_path=not is_global_rank_zero()) File "innereye-deeplearning/InnerEye/ML/utils/checkpoint_handling.py", line...

ant0nsc

Make unit tests faster

1

At present running all the unit tests in WSL on my laptop takes 42 minutes. That is too long for me to run all the tests locally before pushing any...

dumbledad

InnerEye-DeepLearning
InnerEye-DeepLearning copied to clipboard

Metadata

Jobs file when they get pre-empted after uploading files

Allow batch sizes > 1 for classification model inference.

Add file synchronization support for multiple nodes

Add support for model finetuning

Run recovery for hyperdrive runs does not work

Make unit tests faster

Comparing Jupyter Notebooks (during testing)

Model baseline comparison log name and dataset

Aggregate SoftDice loss for segmentation models over all GPUs

Design a generic way of model comparison also for container models

← Metadata

Owner

Metadata

InnerEye-DeepLearning InnerEye-DeepLearning copied to clipboard

Metadata

← Metadata

Owner

Metadata

InnerEye-DeepLearning
InnerEye-DeepLearning copied to clipboard