BiSeNet icon indicating copy to clipboard operation
BiSeNet copied to clipboard

RuntimeError: copy_if failed to synchronize: an illegal memory access was encountered

Open ltshan opened this issue 4 years ago • 11 comments

HI, Thanks for your great work. Now I'm trying to train model based on myself data using single GPU and I already made a few modifications as your readme. But I met the the below issue when starting training. Could you help check it. thanks again.

image

ltshan avatar Sep 01 '20 02:09 ltshan

change picture as log: bc311@bc311-ai1:/work/xxx/BiSeNet$ python3 tools/train.py --model bisenetv2 loss_pre= tensor(1.2688, device='cuda:0', grad_fn=<MeanBackward0>) loss_aux= [tensor(9.9736, device='cuda:0', grad_fn=<MeanBackward0>), tensor(3.0100, device='cuda:0', grad_fn=<MeanBackward0>), tensor(5.4277, device='cuda:0', grad_fn=<MeanBackward0>), tensor(3.6156, device='cuda:0', grad_fn=<MeanBackward0>)] sum= tensor(22.0268, device='cuda:0', grad_fn=<AddBackward0>) loss= tensor(23.2956, device='cuda:0', grad_fn=<AddBackward0>) Traceback (most recent call last): File "tools/train.py", line 240, in main() File "tools/train.py", line 236, in main train() File "tools/train.py", line 191, in train loss.backward() File "/home/bc311/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/bc311/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: copy_if failed to synchronize: an illegal memory access was encountered

ltshan avatar Sep 01 '20 02:09 ltshan

i have the same problems after first epoch. i found i do not change the val path in config file. make sure your Bisenetv1.py file has been changed correctly. like im_root='/home/edge/fjj_workspace/data/img', train_im_anns='/home/edge/fjj_workspace/data/trainJK.txt', val_im_anns='/home/edge/fjj_workspace/data/valJK.txt',

jiaji-fang avatar Sep 01 '20 03:09 jiaji-fang

i have the same problems after first epoch. i found i do not change the val path in config file. make sure your Bisenetv1.py file has been changed correctly. like im_root='/home/edge/fjj_workspace/data/img', train_im_anns='/home/edge/fjj_workspace/data/trainJK.txt', val_im_anns='/home/edge/fjj_workspace/data/valJK.txt',

Thanks for your information. seems my case is still different from yours. I meet it when start training for the first batch. And I checked train image path and read, it's no problem.

ltshan avatar Sep 03 '20 09:09 ltshan

Hi,

are you using your own dataset or dataset of cityscapes ?

CoinCheung avatar Sep 03 '20 09:09 CoinCheung

Hi,

are you using your own dataset or dataset of cityscapes ?

It's my own dataset

ltshan avatar Sep 03 '20 14:09 ltshan

hi,  sorry for replying late. i will check your issue now. i finetune my own dataset

---Original--- From: "ltshan"<[email protected]> Date: Thu, Sep 3, 2020 22:34 PM To: "CoinCheung/BiSeNet"<[email protected]>; Cc: "jiaji-fang"<[email protected]>;"Comment"<[email protected]>; Subject: Re: [CoinCheung/BiSeNet] RuntimeError: copy_if failed to synchronize: an illegal memory access was encountered (#80)

Hi,

are you using your own dataset or dataset of cityscapes ?

It's my own dataset

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

jiaji-fang avatar Sep 03 '20 14:09 jiaji-fang

How many categories are there in your own dataset? Are u using the dataset class designed for cityscapes or implemented a new dataset class ?

CoinCheung avatar Sep 04 '20 02:09 CoinCheung

Please notice that training labels of cityscapes are mapped from the label images pixels according to the specification. See this: https://github.com/CoinCheung/BiSeNet/blob/aa3876b4b1f2c430e07678f8c15b96465681fca0/lib/base_dataset.py#L44

CoinCheung avatar Sep 04 '20 02:09 CoinCheung

there are 3 classes, including background for my dataset. it's pascal voc format. how to set class number in config file? and by my check, self.lb_map is NOT none, how to change it for my dataset?

thanks

ltshan avatar Sep 04 '20 13:09 ltshan

Hello@CoinCheung, I met this error, could you have any ideas? Traceback (most recent call last): File "D:/GitHub/BiSeNet/tools/train_amp.py", line 219, in main() File "D:/GitHub/BiSeNet/tools/train_amp.py", line 215, in main train() File "D:/GitHub/BiSeNet/tools/train_amp.py", line 162, in train loss_pre = criteria_pre(logits, lb) File "D:\LenovoSoftstore\Install\python3.8\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "D:\GitHub\BiSeNet\lib\ohem_ce_loss.py", line 37, in forward loss = self.criteria(logits, labels).view(-1) #logits:4 19 1024 1024 labels:4 1024 1024 File "D:\LenovoSoftstore\Install\python3.8\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "D:\LenovoSoftstore\Install\python3.8\lib\site-packages\torch\nn\modules\loss.py", line 1163, in forward return F.cross_entropy(input, target, weight=self.weight, File "D:\LenovoSoftstore\Install\python3.8\lib\site-packages\torch\nn\functional.py", line 2996, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: CUDA error: an illegal memory access was encountered

miscedence12 avatar Aug 17 '22 08:08 miscedence12

@miscedence12 Did you check your dataset? You label range?

CoinCheung avatar Aug 17 '22 09:08 CoinCheung

I am closing this, since the problem is likely to have been solved.

CoinCheung avatar Aug 07 '23 10:08 CoinCheung