OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

onmt_train an illegal memory access was encountered

Open zhangqianjin opened this issue 4 years ago • 1 comments

onmt_train -data demo/data -save_model demo-model -layers 6 -rnn_size 64 -word_vec_size 64 -transformer_ff 256 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 20000 -max_generator_batches 2 -batch_size 640 -dropout 0.1 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 1000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 50 -save_checkpoint_steps 500 -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7

when begin valid. occur
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCReduceAll.cuh:327 what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fd98245c536 in /data/common_tool/anaconda3/envs/dnn/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7fd98269ffbe in /data/common_tool/anaconda3/envs/dnn/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)

pytorch1.5 cuda10.2

zhangqianjin avatar Jul 28 '20 10:07 zhangqianjin

same here

StephenKyung avatar May 14 '21 04:05 StephenKyung