Issue during training for multiple classes
Hi Sir. We faced an issue during our tries for training the model in multiple classes. We modified the sfd.yaml file based on the voxel_rcnn (please find attached the sfd.yaml file in text format) We received the following error message:
Traceback (most recent call last):
File "train.py", line 200, in
Do you have any solution please? We tried with batch size of 1 in the beggining but the model couldn't perform the batch normalization so we increased the batch size to 2. Also we have one single gpu.
Hi, the current version of our code seems only support 1 sample on each GPU, could you show the error information after you train with batch size of 1.
Thank you for your answer. Attached you can find the traceback error after trying training with batch size of 1. message (6).txt
@LittlePey Does the above traceback error help you to understand the issue? Is there a solution? Thank you in advance.
@LittlePey Any idea?
Hi, it is a bug in SFD code when there is no pseudo point in any ROI. Sometimes, we just resume the latest checkpoint and the error will disappear.
Hi @LittlePey and thank you for your answer. To be honest I didn't understand your answer. I will try to repeat the issue. We tried to train the SFD for multiple classes and with batch size of 1 in a single gpu machine. We received the attached error. (We tried also with greater batch sizes but as you mentioned the current version of our code seems only support 1 sample on each GPU). How can we resume the latest checkpoint if the training doesn't start at all? Is there a solution? What do you think? Thank you in advance. message.6.txt
Hi, we didn't encounter your problem that the training doesn't start at all, maybe you can skip forward and backward when this situation happened.
Hi again @LittlePey . Any update on this? Is there a way to configure the training procedure with batch size of 2?
This bug is appear before the first epoch finished. So how can I resume the latest checkpoint?