VisionLAN icon indicating copy to clipboard operation
VisionLAN copied to clipboard

Training problems about CTCloss and Chinese training.

Open PkuDavidGuan opened this issue 3 years ago • 1 comments

Dear yuxin, sorry to bother you again. When I use your code, I found two new questions: 1. When I executed python train_LF_1.py, I got a CUDA error in ClassNLLCriterion.cu. 2. When I modify the code into Chinese training, the model could not converge.

Question 1: CUDA error in ClassNLLCriterion.cu.

The error info:

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=710 : device-side assert triggered                          
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.

solution:

The problem seems like a bug (My PyTorch version is 1.71). When I modified the nclass from 37 to 38, the problem is gone. I think 38 is reasonable: 36 normal chars, 1 , and 1 . I modified these two lines:

VisionLAN.py
71:        self.Prediction = Prediction(n_position=256, N_max_character=26, n_class=37) # N_max_character = 1 eos + 25 characters
72:        self.nclass = 37

Question 2: Chinese training is failed.

I modified the codes for Chinese training, but the model could not be coverged. The loss drops very slowly. Do you modify the training config when training TRW15? image

PkuDavidGuan avatar Oct 26 '21 03:10 PkuDavidGuan

  1. For the 1st question, this line eliminate the recognition of specific symbols, and the classification categories contains 37 classes (0-9, a-z and an EOS symbol). Your mentioned problem seems that you predict the character beyond the predifined dictionary.

  2. For the 2nd questionn, we simply modify the max length, classification categories to fit the TRW15 dataset. Your may print these two lines 1st and 2nd to ensure that you generate the right label.

Hope these answers will help you.

wangyuxin87 avatar Oct 26 '21 06:10 wangyuxin87