SFD icon indicating copy to clipboard operation
SFD copied to clipboard

Issue during training for multiple classes

Open NNtamp opened this issue 3 years ago • 9 comments

Hi Sir. We faced an issue during our tries for training the model in multiple classes. We modified the sfd.yaml file based on the voxel_rcnn (please find attached the sfd.yaml file in text format) We received the following error message:

Traceback (most recent call last): File "train.py", line 200, in main() File "train.py", line 155, in main train_model( File "/workspace/SFD/tools/train_utils/train_utils.py", line 86, in train_model accumulated_iter = train_one_epoch( File "/workspace/SFD/tools/train_utils/train_utils.py", line 19, in train_one_epoch batch = next(dataloader_iter) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/workspace/SFD/pcdet/datasets/kitti/kitti_dataset_sfd.py", line 517, in collate_batch ret[key] = np.stack(val, axis=0) File "<array_function internals>", line 5, in stack File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 427, in stack raise ValueError('all input arrays must have the same shape') ValueError: all input arrays must have the same shape

Do you have any solution please? We tried with batch size of 1 in the beggining but the model couldn't perform the batch normalization so we increased the batch size to 2. Also we have one single gpu.

sfd_yaml_file.txt

NNtamp avatar Aug 23 '22 13:08 NNtamp

Hi, the current version of our code seems only support 1 sample on each GPU, could you show the error information after you train with batch size of 1.

LittlePey avatar Aug 24 '22 02:08 LittlePey

Thank you for your answer. Attached you can find the traceback error after trying training with batch size of 1. message (6).txt

NNtamp avatar Aug 24 '22 06:08 NNtamp

@LittlePey Does the above traceback error help you to understand the issue? Is there a solution? Thank you in advance.

NNtamp avatar Aug 25 '22 09:08 NNtamp

@LittlePey Any idea?

NNtamp avatar Aug 28 '22 17:08 NNtamp

Hi, it is a bug in SFD code when there is no pseudo point in any ROI. Sometimes, we just resume the latest checkpoint and the error will disappear.

LittlePey avatar Aug 29 '22 02:08 LittlePey

Hi @LittlePey and thank you for your answer. To be honest I didn't understand your answer. I will try to repeat the issue. We tried to train the SFD for multiple classes and with batch size of 1 in a single gpu machine. We received the attached error. (We tried also with greater batch sizes but as you mentioned the current version of our code seems only support 1 sample on each GPU). How can we resume the latest checkpoint if the training doesn't start at all? Is there a solution? What do you think? Thank you in advance. message.6.txt

NNtamp avatar Aug 29 '22 05:08 NNtamp

Hi, we didn't encounter your problem that the training doesn't start at all, maybe you can skip forward and backward when this situation happened.

LittlePey avatar Sep 05 '22 02:09 LittlePey

Hi again @LittlePey . Any update on this? Is there a way to configure the training procedure with batch size of 2?

NNtamp avatar Sep 15 '22 06:09 NNtamp

This bug is appear before the first epoch finished. So how can I resume the latest checkpoint?

Dowe-dong avatar Sep 20 '22 02:09 Dowe-dong