LSD-seg icon indicating copy to clipboard operation
LSD-seg copied to clipboard

Array size mismatch when calculating cross_entropy2d

Open chrisliu54 opened this issue 6 years ago • 14 comments

This happens when executing nll_loss located in code/torchfcn/utils.py under the mode of sourceonly. The training data is from GTA5.

The error occurs because the snippet inside the cross_entropy2d() first use a mask to exclude elements whose values are less than 0 in target(that is, labels). In other words, mislabeled pixels are not involved when calculating cross entropy.

However, the corresponding prediction values for those removed pixels still exist in log_p, which leads to the array size conflict.

chrisliu54 avatar Apr 11 '18 06:04 chrisliu54

To use GTA5 data, the label set has to be mapped to the Cityscapes label set. Since GTA5 has a lot more classes and our target is CityScapes, we care only about the common classes. The data organization documentation provided here should help you:

https://github.com/VisionLearningGroup/taskcv-2017-public/tree/master/segmentation

We will include details about this in the README soon.

swamiviv avatar Apr 11 '18 11:04 swamiviv

Actually, I found elements in log_p is not removed accordingly if their corresponding label in target is out of range.

chrisliu54 avatar Apr 11 '18 12:04 chrisliu54

This should not happen if the labels are preprocessed correctly. Refer to this condition here:

https://github.com/swamiviv/LSD-seg/blob/e373e896be2de7b8c4a0bfa2a7ab0efc924836fd/code/torchfcn/utils.py#L55

This condition should ensure that the values of log_p exist only for the "in range" values of the target.

swamiviv avatar Apr 11 '18 13:04 swamiviv

Well, things go on unexpectedly. When I train on the GTA5 data, the program occasionally falls into the except block below: https://github.com/swamiviv/LSD-seg/blob/e373e896be2de7b8c4a0bfa2a7ab0efc924836fd/code/torchfcn/utils.py#L54-L57 BTW, I organized the file structure of dataset according to your code's specification.

chrisliu54 avatar Apr 11 '18 13:04 chrisliu54

In GTA5, there are some image-label pairs which are not of the same size. Hence this exception might be triggered there. Please use the clean filelist that we have uploaded in this repo ? You can find this in the data/filelist directory. For training GTA5, these should be GTA5_<image/label>list_train.txt.

swamiviv avatar Apr 11 '18 13:04 swamiviv

Thanks, but I'm using your data/filelist directory indeed. However, the results still turned to be what I said above. Could you please specify your pytorch/torchvision version? At first I used pytorch: 0.3.1-py36_cuda8.0.61_cudnn7.1.2_3 and torchvision: 0.2.0-py36h17b6947_1, but I got this error when doing backward() in sourceonly mode with GTA5:

RuntimeError: invalid argument 1: the number of sizes provided must be greater or equal to the number of dimensions in the tensor at /opt/conda/conda-bld/pytorch_1523244252089/work/torch/lib/THC/generic/THCTensor.c:326

So I use the pytorch compiled from source(the latest version) according to this post. I'm not sure whether it is the version of pytorch that leads to this problem. Thx.

chrisliu54 avatar Apr 11 '18 14:04 chrisliu54

>>> import torch
>>> torch.__version__
'0.2.0_3'
>>> import torchvision
>>> torchvision.__version__
'0.2.0'

Does this error occur with all files ? Can you verify why these errors occur ? If you can give more info from your end, we can help debugging this.

swamiviv avatar Apr 11 '18 14:04 swamiviv

I guess it is all about the version of pytorch. When I used 0.3.1, cross_entropy2d worked fine but backward() met an RuntimeError said above. When I used the latest pytorch compiled from source, backward() got right but cross_entropy2d failed in the try-except block below: https://github.com/swamiviv/LSD-seg/blob/e373e896be2de7b8c4a0bfa2a7ab0efc924836fd/code/torchfcn/utils.py#L54-L57

BTW, when trying to train on SYNTHIA dataset, I don't know which directory should be marked as synthia_mapped_to_cityscapes specified in LSD-seg/data/filelist/SYNTHIA_labellist_train.txt. Since SYNTHIA-RAND-CITYSCAPES datasets contains only three subdirectory, namely, Depth, GT and RGB(which you choosed to be training images). After check with the images themselves, I picked GT/COLOR as synthia_mapped_to_cityscapes(labels). Then I run the code in sourceonly mode, I then got RuntimeError due to array size mismatch in cross_entropy2d(), specifically, target(that is, labeled image) has shape of (minibatch x h x w x 4), not (minibatch x h x w). https://github.com/swamiviv/LSD-seg/blob/e373e896be2de7b8c4a0bfa2a7ab0efc924836fd/code/torchfcn/utils.py#L35-L44

chrisliu54 avatar Apr 12 '18 01:04 chrisliu54

For the SYNTHIA dataset, the labels need to be mapped to be common with cityscapes. We will upload the mapped data soon.

On Wed, Apr 11, 2018 at 9:58 PM, chrisliu54 [email protected] wrote:

I guess it is all about the version of pytorch. When I used 0.3.1, cross_entropy2d worked fine but backward() met an RuntimeError said above. When I used the latest pytorch compiled from source, backward() got right but cross_entropy2d failed in the try-except block below: https://github.com/swamiviv/LSD-seg/blob/e373e896be2de7b8c4a0bfa2a7ab0e fc924836fd/code/torchfcn/utils.py#L54-L57

BTW, when trying to train on SYNTHIA dataset, I don't know which directory should be marked as synthia_mapped_to_cityscapes specified in LSD-seg/data/filelist/SYNTHIA_labellist_train.txt. Since SYNTHIA-RAND-CITYSCAPES datasets contains only three subdirectory, namely, Depth, GT and RGB(which you choosed to be training images). After check with the images themselves, I picked GT/COLOR as synthia_mapped_to_cityscapes(labels). Then I run the code in sourceonly mode, I then got RuntimeError due to array size mismatch in cross_entropy2d(), specifically, target(that is, labeled image) has shape of (minibatch x h x w x 4), not (minibatch x h x w). https://github.com/swamiviv/LSD-seg/blob/e373e896be2de7b8c4a0bfa2a7ab0e fc924836fd/code/torchfcn/utils.py#L35-L44

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/swamiviv/LSD-seg/issues/1#issuecomment-380650291, or mute the thread https://github.com/notifications/unsubscribe-auth/AGWIeDZ8Lkn_4udQbojMqqlmmOPrsI9Mks5tnrTVgaJpZM4TPdak .

-- --Swami

swamiviv avatar Apr 12 '18 02:04 swamiviv

@swamiviv Do you upload the mapped data?

bjchen666 avatar Jul 24 '18 05:07 bjchen666

Got a problem with the cross_entropy_2d function as well. Training is running well until it breakes at different points. Sometimes it stops after 350 iterations, sometimes after 2000. Turned off shuffling of the filelists so there should be no issue with the input data. Images and labels are fine.

The error which occurs:

Train epoch = 0:  11%|##3                  | 331/2975 [07:48<1:02:12,  1.41s/it][ATraceback (most recent call last):  
File "train.py", line 161, in <module>
    main()
  File "train.py", line 157, in main
    trainer.train()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 361, in train
    self.train_epoch()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 254, in train_epoch
    lossD_src_real_c = cross_entropy2d(outD_src_real_c, label_forD, size_average=self.size_average)
  File "<ROOT>/LSD-seg/code/torchfcn/utils.py", line 65, in cross_entropy2d
    loss = F.nll_loss(log_p, target, weight=weight, size_average=False)
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/nn/functional.py", line 676, in nll_loss
    raise ValueError('Expected 2 or 4 dimensions (got {})'.format(dim))
ValueError: Expected 2 or 4 dimensions (got 0)
Exception KeyError: KeyError(<weakref at 0x7fae42094f70; to 'tqdm' at 0x7fae2c0d7dd0>,) in <bound method tqdm.__del__ of Train:   0%|                                             | 0/33 [07:49<?, ?it/s]> ignored
Exception KeyError: KeyError(<weakref at 0x7fae2edc60a8; to 'tqdm' at 0x7fae2c0b4950>,) in <bound method tqdm.__del__ of Train epoch = 0:  11%|##3                  | 331/2975 [07:48<1:02:12,  1.41s/it]> ignored

Maybe there is a issue due to versions of the installed packages. Until i used the exact version of pytorch you used (0.2.0_3) there were much more issues, so i guess it is important to use exactly your used build @swamiviv . So maybe you could post your versions you used for fcn and opencv packages as well?

Due to the not deterministic behaviour of the training i dont know what to do to get this running. :(

The only change i made was in segmentation_datasets.py to modify cityscapes labels, which i use as source domain. When i used stock code i got the same Exception mentioned above in the cross_entropy_2d function:

/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../THCReduceAll.cuh line=334 error=59 : device-side assert triggered
Exception KeyError: KeyError(<weakref at 0x7fac43de7f70; to 'tqdm' at 0x7fac02b08950>,) in <bound method tqdm.__del__ of Train epoch = 0:   0%|                                 | 0/2975 [00:01<?, ?it/s]> ignored
Exception KeyError: KeyError(<weakref at 0x7fac320bd890; to 'tqdm' at 0x7fac30d6bdd0>,) in <bound method tqdm.__del__ of Train:   0%|                                             | 0/33 [00:02<?, ?it/s]> ignored
Exception:  (1L, 40L, 80L)
Traceback (most recent call last):
  File "train.py", line 161, in <module>
    main()
  File "train.py", line 157, in main
    trainer.train()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 361, in train
    self.train_epoch()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 281, in train_epoch
    lossF_src_adv_s = cross_entropy2d(outD_src_fake_s, domain_labels_tgt_real,size_average=self.size_average)
  File "<ROOT>/LSD-seg/code/torchfcn/utils.py", line 61, in cross_entropy2d
    mask = target >= 0
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/variable.py", line 888, in __ge__
    return self.ge(other)
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/variable.py", line 802, in ge
    return Ge.apply(self, other)
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/_functions/compare.py", line 17, in forward
    mask = getattr(a, cls.fn_name)(b)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/../THCTensorMathCompare.cuh:84

Because of this i flagged all labels, which are out of range (255), with -1, because with 255 we would have more than n_classes.

Due to that i changed the code in segmentation_datasets.py - SegmentationData_BaseClass - __getitem__(self, index) to:

def __getitem__(self, index):
        data_file = self.files[self.split][index]
        
        # Loading image and label
        img, lbl = self.image_label_loader(data_file['img'], data_file['lbl'], self.image_size, random_crop=True)
        img = img[:,:,::-1]
        img -= self.mean_bgr
        img = img.transpose(2, 0, 1)
        
        if self.dset != 'cityscapes':
            lbl[lbl > 18] = -1
        else:
            lbl[lbl == -1] = 19 
            lbl = Image.fromarray(lbl.squeeze().astype(np.uint8))
            lbl = np.array(lbl, dtype=np.int32)
            lbl[lbl > 18] = -1

        img = torch.from_numpy(img.copy()).float() 
        lbl = torch.from_numpy(lbl.copy()).long()

        return img,lbl

Edit/Update Fixed it. Due to image cropping there was a chance that there are images with only dont care labels (-1), so after the line log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0] there wont be any entries left to calculate the loss and the exception was thrown. Wrote a walkaround to catch this and now the training is running fine. :)

Toxiiin avatar Sep 27 '18 13:09 Toxiiin

@Toxiiin may I ask what your workaround involved? I am facing the same issue with

log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]

mattmcc97 avatar Feb 12 '19 13:02 mattmcc97

@mattmcc97 The easiest solution as a workaround would be to ensure, that the cropped images not only consists out of stuff which is flagged as dont care (-1). A possibility would be to loop over the cropping operation until you get an image which has a proper amount of valid labels (!= -1).

Furthermore you could set the calculated loss log_p manually to zero if there are only dont care labels in the current image (and so there wont be an update of the gradients in this step), but this would be the quick and dirty solution I think. :)

Toxiiin avatar Feb 12 '19 13:02 Toxiiin

Thanks, I originally thought I might have a mask with everything labelled not interesting. It turns out, there was some corruption in my masks, and a few pixels were labeled with random unexpected values.

mattmcc97 avatar Feb 12 '19 17:02 mattmcc97