GaNDLF Fix train metrics

The PR adresses a few important bugs in how metrics are calculated. For now the fixes seem to work for binary classification tasks, but not guaranteed for regression/segmentation/multiclass classification (highly probably it would fail).

Proposed Changes

Checklist

[ ] CONTRIBUTING guide has been followed.
[ ] PR is based on the current GaNDLF master .
[ ] Non-breaking change (does not break existing functionality): provide as many details as possible for any breaking change.
[ ] Function/class source code documentation added/updated (ensure typing is used to provide type hints, including and not limited to using Optional if a variable has a pre-defined value).
[ ] Code has been blacked for style consistency and linting.
[ ] If applicable, version information has been updated in GANDLF/version.py.
[ ] If adding a git submodule, add to list of exceptions for black styling in pyproject.toml file.
[ ] Usage documentation has been updated, if appropriate.
[ ] Tests added or modified to cover the changes; if coverage is reduced, please give explanation.
[ ] If customized dependency installation is required (i.e., a separate pip install step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].

May 15 '24 00:05 VukW

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

May 15 '24 00:05 github-actions[bot]

Converting to draft until the tests are passing.

May 19 '24 03:05 sarthakpati

@sarthakpati It seems to me I fixed the code for usual segmentation cases. However, I found that the code is essentially broken for some specific architectures - deep_* and sdnet, where model returns list instead of just one Tensor. The reason is we strongly assume output is Tensor (say, when we aggregate segmentation results during validation step). In master branch the code is not failing but calculates valid metrics in wrong way:

we take just first elem of the list (one tensor)
AFAIU for sdnet it works only if batch_size >= 5 (as prediction here is BxCxHxWxD, we take not 1st piece of output but 1st batch element prediction) and still not properly
for deep_* models here also prediction is BxCxHxWxD and not a list of the tensors

I can't imagine right now how to fix that easily without massive refactoring. I returned the crutch back (so at least code doesn't fail right now). From one side a train metrics are calculated like averaging by sample and thus are calculated properly. From other side, validation metrics are broken. I'd strongly prefer to disable / remove these list architectures from GaNDLF for now; but what do you think?

May 21 '24 22:05 VukW

@szmazurek - can you confirm if the BraTS training is working for you?

Jun 04 '24 15:06 sarthakpati

@szmazurek - can you confirm if the BraTS training is working for you?

Re-launched the training after pulling yesterday's merge by @Geeks-Sid. Will keep you updated if it runs, keeping the rest the same.

Jun 04 '24 17:06 szmazurek

@szmazurek - can you confirm if the BraTS training is working for you?

Re-launched the training after pulling yesterday's merge by @Geeks-Sid. Will keep you updated if it runs, keeping the rest the same.

Still negative. the output:

Looping over training data:   0%|          | 0/6255 [02:51<?, ?it/s]
ERROR: Traceback (most recent call last):
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/bin/gandlf_run", line 126, in <module>
    main_run(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/cli/main_run.py", line 92, in main_run
    TrainingManager_split(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/training_manager.py", line 173, in TrainingManager_split
    training_loop(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/compute/training_loop.py", line 445, in training_loop
    epoch_train_loss, epoch_train_metric = train_network(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/compute/training_loop.py", line 171, in train_network
    total_epoch_train_metric[metric] += metric_val
ValueError: operands could not be broadcast together with shapes (4,) (3,) (4,)

I am using flexinet cosine annealing config as provided on brats 2021.

@sarthakpati @VukW @Geeks-Sid any ideas? Did you maybe succeed?

Jun 04 '24 18:06 szmazurek

@szmazurek Can you plz show the exact config you're using? I'm not familiar with brats challenge:)

Jun 05 '24 14:06 VukW

@szmazurek - I would assume I will also get the same error. I am currently running other jobs so I don't have any free slots to queue up any other training.

Jun 05 '24 14:06 sarthakpati

Hey dears, so, apparently commenting one parameter in the config made it work! The problem was metric option dice_per_label, having that commented did not result in any error. I will tackle that further, I also sent you the example config via gmail @VukW .

Jun 06 '24 07:06 szmazurek

Hey dears, so, apparently commenting one parameter in the config made it work! The problem was metric option dice_per_label, having that commented did not result in any error. I will tackle that further, I also sent you the example config via gmail @VukW .

This will need more investigation - can you please open a new issue to track it?

Jun 06 '24 13:06 sarthakpati

@sarthakpati @szmazurek I catch the bug with @szmazurek 's model config. The issue is when ignore_label_validation is given in the model config, metrics for this specific label is not evaluated, thus metrics output differ from what I assumed (N_CLASSES). Fix: https://github.com/mlcommons/GaNDLF/pull/868/commits/d0d25fbbc91d7f3ae3235a0b3e80ada65d1f8787 Now it works for me

Jun 06 '24 14:06 VukW

@szmazurek can you confirm the fix on your end?

Jun 06 '24 19:06 sarthakpati

@sarthakpati On it, training scheduled. Thanks @VukW for tackling that!

Jun 06 '24 20:06 szmazurek

@sarthakpati On it, training scheduled. Thanks @VukW for tackling that!

Hey, my initial tests failed, it turned out that the error spotted by @VukW was present also in the validation and test loops. Corrected it and successfully completed the entire training epoch, the changes are applied in commit 5148e86.

EDIT: I also now initialized the training with confing in the exact same way as you sent me @sarthakpati, will keep you noticed on the results.

Jun 07 '24 13:06 szmazurek

@VukW - I think the CLA bot it complaining because of https://github.com/mlcommons/GaNDLF/pull/868/commits/5148e86f72a573b2af1b1be9a62fb965d76b1dd4 ... Can you please remove this?

Jul 16 '24 01:07 sarthakpati

😁What a crutch @sarthakpati overrode branch's commits history, making myself as commit author (sorry, @szmazurek ), so failed check should be fixed now. But isn't it strange Szymon's CLA agreement was lost?

Jul 18 '24 11:07 VukW

Thanks!

But isn't it strange Szymon's CLA agreement was lost?

Actually, I think he submitted a PR from a machine where git was improperly configured, and his username for that commit was registered as Mazurek, Szymon instead of szmazurek, and thus resulting in the failed CLA check. This usually happens because in the initial git setup step, git asks for Full Name when it should be asking for username, followed by email.

Jul 18 '24 14:07 sarthakpati

Multiple experiments have shown the validity of this PR:

2D Histology binary segmentation:

3D Radiology multi-class segmentation:

Merging this PR in, and subsequent issues are to be addressed in more PRs.

Jul 19 '24 14:07 sarthakpati