YOLOv3-in-PyTorch icon indicating copy to clipboard operation
YOLOv3-in-PyTorch copied to clipboard

Training Fails with COCO Dataset

Open Hi-Chem246 opened this issue 5 years ago • 19 comments

Operating System: Debian 9 (Stretch)

Reproducible: Always

Steps to Reproduce:

  1. cd into the src folder

  2. Use the command: python3 main.py train --verbose --dataset coco --img-dir /home/user/COCO/train2017 --annot-path /home/user/COCO/annotations/instances_train2017.json --reset-weights

Observed Behaviour: A Value Error was received: “The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”

Expected Behaviour: Training should have been completed

Hi-Chem246 avatar Feb 18 '20 16:02 Hi-Chem246

I am getting the same error. Here is the error message:

Traceback (most recent call last):
  File "main.py", line 441, in <module>
    run_yolo_training(options)
  File "main.py", line 434, in run_yolo_training
    ckpt_dir)
  File "main.py", line 239, in run_training
    for batch_i, (imgs, targets, target_lengths) in enumerate(dataloader):
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/home/userloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/home/userloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/home/userloader.py", line 881, in _process_data
    data.reraise()
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/mlenv/lib/python3.7/site-packages/torch/utils/home/user/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/coco.py", line 70, in __getitem__
    transformed_img_tensor, label_tensor = self._tf(img, label_tensor)
  File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/transforms.py", line 168, in __call__
    img, label = t(img, label)
  File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/transforms.py", line 157, in __call__
    label = _affine_transform_label(label, affine_transform_matrix)
  File "/home/user/repos/YOLOv3-in-PyTorch/src/home/usersets/transforms.py", line 316, in _affine_transform_label
    x1 = np.minimum(xy_lt[:, 0], xy_lb[:, 0])
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

gokuldas avatar Feb 18 '20 21:02 gokuldas

I did some debugging and reached line 295 in src/datasets/transforms.py. https://github.com/westerndigitalcorporation/YOLOv3-in-PyTorch/blob/5944f2eae2b8e1e64c7c50cd42aa5e97e9d0e98c/src/datasets/transforms.py#L295-L297

This is where the RSS matrix is created in _get_affine_matrix method. If I understand correctly, the RSS matrix is meant to be a 3 * 3 matrix. The angle and shear should be scalars for that to be true. The angle used for calculation is a scalar, but the shear is a 1 * 2 matrix. This creates the RSS matrix and affine matrix with a weird shape, and eventually leads to error message seen. This error message is triggered at: https://github.com/westerndigitalcorporation/YOLOv3-in-PyTorch/blob/5944f2eae2b8e1e64c7c50cd42aa5e97e9d0e98c/src/datasets/transforms.py#L316

gokuldas avatar Feb 18 '20 21:02 gokuldas

Further debugging led to these 2 lines in transforms.py in method RandonAffineWithLabel.__call__ https://github.com/westerndigitalcorporation/YOLOv3-in-PyTorch/blob/5944f2eae2b8e1e64c7c50cd42aa5e97e9d0e98c/src/datasets/transforms.py#L152-L153

self.shear is (-10, 10) in all cases and returns a 2 * 1 array of form [n, 0]. n is between -10 and 10. The documentation suggests that they is x and y axes shear values. Y axis shear is 0 since it is not requested.

We could potentially solve the problem by adding this line after line 153:

shear = shear[0]

I am not sure if all my assumptions are right. Waiting for a review.

gokuldas avatar Feb 18 '20 22:02 gokuldas

Hi @Hi-Chem246 and @gokuldas

Sorry for the late reply. This repo is not actively maintained since we are switching to mmdetection framework which has a better support of pipeline.

To answer your question in general, since it depends on torchvision for transformation, so maybe the change of API causes the error.

By looking at the torchvision CLs, this might be the reason: https://github.com/pytorch/vision/commit/55088157c09c9368fdffaaaaacf5f7f3db641aac#diff-fc1f220b470714d05cf3ea6acf9fed59

wuhy08 avatar Feb 21 '20 00:02 wuhy08

I encourage you to try yolo on mmdetection. I will be more extendable than this one

https://github.com/wuhy08/mmdetection/tree/yolo

wuhy08 avatar Feb 21 '20 00:02 wuhy08

Hi @wuhy08,

Thanks for the information and suggestion! I will check it out.

Your work is really great!

gokuldas avatar Feb 22 '20 14:02 gokuldas

@Hi-Chem246 hello, how are you? did you successfully trained yolo on COCO?

eng100200 avatar May 20 '20 08:05 eng100200

@Hi-Chem246 hello, how are you? did you successfully trained yolo on COCO?

eng100200 avatar May 20 '20 08:05 eng100200

@eng100200 Not OP, but we did eventually get it to work with some imperfections. The function that threw the exception was meant to transform the training images and its labels randomly, but identically. The image was transformed using torchvision and the label was transformed using a custom function identical to the one in torchvision. The torchvision transform gradually evolved and diverged from the transform used here. We partially solved the problem by making them somewhat similar. There were other issues too - including one that caused the dataloader to run out of shared memory and crash. We worked around some of these issues, but never solved them entirely.

gokuldas avatar May 21 '20 00:05 gokuldas

@gokuldas hello how are you? i did understand meaning "OP". However, i must say, that during the training i have used Person_keypoints_train_2017.json. But, when i use this annotation file it generates some of the empty data labels, since, i believe this annotation only contains human category samples and ids. Since the repositort says to use pycocotool API i belive the API may not have problem to read the annotations, it just read this and generate only human category samples, but, i dont know it still has this problem and generates labels like this>> image see there are many labels have no data!

eng100200 avatar May 21 '20 01:05 eng100200

@gokuldas @wuhy08 please do respond.

eng100200 avatar May 21 '20 01:05 eng100200

@eng100200 I haven't looked at the annotation file. But have you checked a few sample images? The missing labels may be correct?

That aside, I recommend moving to the repo suggested by the author. This repo hasn't been updated in a while and breaks a lot due to mismatch with updated dependencies.

gokuldas avatar May 22 '20 00:05 gokuldas

@gokuldas what do you mean by "missing labels may be correct"? ok i will move to the suggested repo. Do you have email?

eng100200 avatar May 22 '20 00:05 eng100200

@gokuldas hello, did you know how to install git lfs?

eng100200 avatar May 25 '20 12:05 eng100200

@gokuldas please reply

eng100200 avatar May 25 '20 12:05 eng100200

@gokuldas please reply

eng100200 avatar May 25 '20 13:05 eng100200

@eng100200 Sorry for the delay. Git LFS is available here: https://git-lfs.github.com/ . You will have to run git lfs install before you clone this repo.

gokuldas avatar May 25 '20 21:05 gokuldas

@gokuldas do you have email, actualy i need to discuss couple of things if you dont mind, and there is alot of doubts about this code in my mind. my email is [email protected]

eng100200 avatar May 26 '20 00:05 eng100200

@gokuldas i will wait for your response

eng100200 avatar May 26 '20 01:05 eng100200