Deep_Openset_Recognition_through_Uncertainty about multi GPU to train the entire model

When I try to use several GPUs to train the whole model, the output_sample's shape turn to [GPU's nums, batch_size/GPU's nums, class_num] I wonder why the model's output has changed?

May 29 '20 08:05 whySnowwW

Hey, I actually don't really have any idea what happened, but I am able to reproduce this issue. The code is running completely fine if I only use 1 GPU (e.g. by setting CUDA_VISIBLE_DEVICES=ID with only one ID specified in the terminal).

Unfortunately I don't really have a lot of time right now to debug this in detail. Given that the code runs just fine with single GPU and the memory requirements aren't all too large, can you just run on a single GPU for now?

Alternatively I would appreciate any help in debugging. I likely won't get around to it myself for the next 1-2 weeks as I have a lot going on right now.

May 29 '20 13:05 MrtnMndt

Thanks for your help. As when I evaluate the model I trained on my GPU, there is always some OOM error. Thus I want to train and evaluate the model on multi GPUs. I wonder if there is other method to evaluate the model. Can I just turn this model to a classifier model?

Jun 01 '20 07:06 whySnowwW

Hey. The answer to your question depends on what you mean by "evaluate" your model.

If by evaluate you mean you only want to get the classification performance after training is done, then you no longer need the decoder and can technically discard it (saving about half the memory requirement). Note that you do however need the decoder during training as it is a major influence on the latent space and the reason you are approximating the data distribution in the first place.

That said, if you were able to train the model on your GPU, then there is no reason you cannot evaluate it as well. Remember that in evaluation there is no dependency on the mini-batch size at all as there is no longer any stochastic gradient descent updates anymore. So you can just set it to whatever low value that your hardware on which you evaluate model can handle.

Jun 01 '20 13:06 MrtnMndt

I met the same problem when i evaluated the third model the paper proposed, why does I have enough memory for training but not enough for evaluation?

Nov 12 '20 11:11 yechenzhi

Hey @yechenzhi , There could be two reasons why you experience difficulty in the evaluation.

The model that makes use of variational inference is typically trained with only 1 sample from the approximate posterior during training. In our evaluation script we give the option to also investigate uncertainty with a larger amount of samples, and re-calculation of entire encoder/decoder with random dropout (as an approximation to weight uncertainty). Both of these will require some more memory.
The evaluation script also calculates our Weibull EVT fits for which it needs to accumulate latent means. This consumes some extra memory that might just tip your specific hardware over the edge.

In either case, neither of the above should be a fundamental issue as we do not permanently store results on GPU but move them to CPU after calculation. While batch size matters for training, you can just lower the "batch-size" to even 1, the number of calculated images in parallel is just to speed up calculation in evaluation and you can set it to whatever works for your hardware.

I hope this helps, let me know if you have any further questions

Nov 12 '20 12:11 MrtnMndt

@MrtnMndt @whySnowwW @yechenzhi Hi, when I use the command 'python3 eval_openset.py --resume <path/to/model> --openset-datasets 'FashionMNIST,AudioMNIST,CIFAR10,CIFAR100,SVHN' to evaluate the open set datasets, I got this error: It means that the trained model was not saved, right? So I wonder how to save it correctly. During the training stage, I use this command 'python3 main.py --dataset FashionMNIST --dropout 0.2 --openset-data sets SVHNN --resume path/' to train model, and got the same error. Is that the right command? If not, how? I would appreciate it very much that if any of you could take several minutes to help and answer. Thanks!

Mar 25 '21 01:03 Alabenba

Hey @Alabenba , the --resume command is used when you already have a model that you want to evaluate or continue training. If you want to train a model, simply use the python3 main.py without any extra commands on open set data or resumed models. python3 main.py --dataset FashionMNIST --dropout 0.2 Here, you can add choice of architecture -a WRN, whether it is variational --train-var True, whether it is a VAE with classifier (i.e. classifier with decoder)--joint True etc.

Does this help? Let me know if you have any further questions.

Mar 25 '21 08:03 MrtnMndt

@MrtnMndt Thx so much for your timely reply. I trained the WRN as the way you said, but when I run the eval_openset.py by using 'python3 eval_openset.py --resume <path/to/model> --openset-datasets 'FashionMNIST,AudioMNIST,CIFAR10,CIFAR100,SVHN', an error occurred. What's the problem? And how can I fix it?

Mar 26 '21 04:03 Alabenba

@Alabenba You will need to specify the path to your trained and saved model with --resume in order to evaluate a specific model. Your error message asserts that the specified path is empty.

Mar 26 '21 08:03 MrtnMndt

Deep_Openset_Recognition_through_Uncertainty Deep_Openset_Recognition_through_Uncertainty copied to clipboard

about multi GPU to train the entire model

Deep_Openset_Recognition_through_Uncertainty
Deep_Openset_Recognition_through_Uncertainty copied to clipboard