YOLOv3-in-PyTorch
YOLOv3-in-PyTorch copied to clipboard
Training Fails with COCO Dataset
Operating System: Debian 9 (Stretch)
Reproducible: Always
Steps to Reproduce:
-
cd into the src folder
-
Use the command: python3 main.py train --verbose --dataset coco --img-dir /home/user/COCO/train2017 --annot-path /home/user/COCO/annotations/instances_train2017.json --reset-weights
Observed Behaviour: A Value Error was received: “The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”
Expected Behaviour: Training should have been completed
I am getting the same error. Here is the error message:
Traceback (most recent call last):
File "main.py", line 441, in <module>
run_yolo_training(options)
File "main.py", line 434, in run_yolo_training
ckpt_dir)
File "main.py", line 239, in run_training
for batch_i, (imgs, targets, target_lengths) in enumerate(dataloader):
File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/home/userloader.py", line 345, in __next__
data = self._next_data()
File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/home/userloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/home/userloader.py", line 881, in _process_data
data.reraise()
File "/home/user/mlenv/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/coco.py", line 70, in __getitem__
transformed_img_tensor, label_tensor = self._tf(img, label_tensor)
File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/transforms.py", line 168, in __call__
img, label = t(img, label)
File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/transforms.py", line 157, in __call__
label = _affine_transform_label(label, affine_transform_matrix)
File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/transforms.py", line 316, in _affine_transform_label
x1 = np.minimum(xy_lt[:, 0], xy_lb[:, 0])
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I did some debugging and reached line 295 in src/datasets/transforms.py. https://github.com/westerndigitalcorporation/YOLOv3-in-PyTorch/blob/5944f2eae2b8e1e64c7c50cd42aa5e97e9d0e98c/src/datasets/transforms.py#L295-L297
This is where the RSS
matrix is created in _get_affine_matrix method
. If I understand correctly, the RSS matrix is meant to be a 3 * 3 matrix. The angle and shear should be scalars for that to be true. The angle used for calculation is a scalar, but the shear is a 1 * 2 matrix. This creates the RSS matrix and affine matrix with a weird shape, and eventually leads to error message seen. This error message is triggered at: https://github.com/westerndigitalcorporation/YOLOv3-in-PyTorch/blob/5944f2eae2b8e1e64c7c50cd42aa5e97e9d0e98c/src/datasets/transforms.py#L316
Further debugging led to these 2 lines in transforms.py in method RandonAffineWithLabel.__call__
https://github.com/westerndigitalcorporation/YOLOv3-in-PyTorch/blob/5944f2eae2b8e1e64c7c50cd42aa5e97e9d0e98c/src/datasets/transforms.py#L152-L153
self.shear
is (-10, 10)
in all cases and returns a 2 * 1 array of form [n, 0]
. n
is between -10 and 10. The documentation suggests that they is x and y axes shear values. Y axis shear is 0 since it is not requested.
We could potentially solve the problem by adding this line after line 153:
shear = shear[0]
I am not sure if all my assumptions are right. Waiting for a review.
Hi @Hi-Chem246 and @gokuldas
Sorry for the late reply. This repo is not actively maintained since we are switching to mmdetection framework which has a better support of pipeline.
To answer your question in general, since it depends on torchvision for transformation, so maybe the change of API causes the error.
By looking at the torchvision CLs, this might be the reason: https://github.com/pytorch/vision/commit/55088157c09c9368fdffaaaaacf5f7f3db641aac#diff-fc1f220b470714d05cf3ea6acf9fed59
I encourage you to try yolo on mmdetection. I will be more extendable than this one
https://github.com/wuhy08/mmdetection/tree/yolo
Hi @wuhy08,
Thanks for the information and suggestion! I will check it out.
Your work is really great!
@Hi-Chem246 hello, how are you? did you successfully trained yolo on COCO?
@Hi-Chem246 hello, how are you? did you successfully trained yolo on COCO?
@eng100200 Not OP, but we did eventually get it to work with some imperfections. The function that threw the exception was meant to transform the training images and its labels randomly, but identically. The image was transformed using torchvision and the label was transformed using a custom function identical to the one in torchvision. The torchvision transform gradually evolved and diverged from the transform used here. We partially solved the problem by making them somewhat similar. There were other issues too - including one that caused the dataloader to run out of shared memory and crash. We worked around some of these issues, but never solved them entirely.
@gokuldas hello how are you? i did understand meaning "OP". However, i must say, that during the training i have used Person_keypoints_train_2017.json. But, when i use this annotation file it generates some of the empty data labels, since, i believe this annotation only contains human category samples and ids. Since the repositort says to use pycocotool API i belive the API may not have problem to read the annotations, it just read this and generate only human category samples,
but, i dont know it still has this problem and generates labels like this>>
see there are many labels have no data!
@gokuldas @wuhy08 please do respond.
@eng100200 I haven't looked at the annotation file. But have you checked a few sample images? The missing labels may be correct?
That aside, I recommend moving to the repo suggested by the author. This repo hasn't been updated in a while and breaks a lot due to mismatch with updated dependencies.
@gokuldas what do you mean by "missing labels may be correct"? ok i will move to the suggested repo. Do you have email?
@gokuldas hello, did you know how to install git lfs?
@gokuldas please reply
@gokuldas please reply
@eng100200 Sorry for the delay. Git LFS is available here: https://git-lfs.github.com/ . You will have to run git lfs install
before you clone this repo.
@gokuldas do you have email, actualy i need to discuss couple of things if you dont mind, and there is alot of doubts about this code in my mind. my email is [email protected]
@gokuldas i will wait for your response