TurboTransformers
TurboTransformers copied to clipboard
[TT_ERROR] CUDA runtime error: an illegal memory access was encountered TurboTransformers/turbo_transformers/core/cuda_device_context.cpp:33
Below is a rough code to explain what I did.
import torch.multiprocessing as mp
def inference(config):
data_loader = get_loader(config)
while True:
for step in range(128):
dec_outs, _ = turbo_decoder(current_pred,
memory_bank,
step,
memory_lengths=memory_lengths)
ctx = mp.get_context("spawn")
p = ctx.Process(target=inference, args=(config))
p.start()
p.join()
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I got the above error when using turbo_decoder to generate data for training. The error could appear at any iteration. Sometimes after calling inference hundreds of times, sometimes after calling it thousands of times. It looks similar to #174, but I have not found the real solution for two days.
Hopefully you could shed some light on this.
Did huggingface's decoder work for you?
I only tried opennmt transformer decoder, because I have many customized operations and the opennmt looks more flexible.
So, your code works when using opennmt, but is failed after switching to turbo?
Yes. Opennmt decoder works fine. I just used the transformer decoder similar to the one in "Attention is all you need". Basically, I used the transformer decoder in modeling_decoders.py, and initialized it using the from_onmt method.
It looks that you are using turbo correctly. It maybe some bad cases in Turbo decoder. Can you run OpenNMT decoder and Turbo simultaneously? And check the results of two decoders at each step.
I have done that. The results match correctly. However, when you call inference many times, it may break at some point.
Specifically, it looks like the inference breaks at the while loop level rather than the inner for loop level as shown in the example code above.
Did you see the Device memory usage? The crush may result from excessive GPU memory consumption.
nvidia-smi --id=0 --query-compute-apps=used_memory --format=csv -lms 100
I checked that as well. Only used less than 50%.
By the way, I found after removing the "with torch.no_grad" over the inference function, it is less likely to crash and runs longer. Not sure if this information is helpful.
I have some hints for you to debug.
- use CPU only to check if the results are always the same with ONMT.
- Try to upgrade your PyTorch version since I am not sure the error is thrown from the turbo.
又debug了两天,gcc版本,pytorch版本,docker镜像,都换过了,其它能想到的一些方法也试了,只要调用次数一多,还是会报错。 基本上发现是,对于turbo decoder的输入和输出做的操作越少,跑得越久,输入和输出操作前都要deepcopy,否则很快就报错了。但是这样也没有从根本上解决问题,最多调用一万次左右,还是会报错。我猜可能是turbo本身不稳定,反复多次调用容易出问题?
有可能是内存管理的第三方库cub不稳定。 你把naive allocator里面都改成在显存里直接分配试试。
return allocate_impl(size, kDLGPU);
allocate_free(mem, kDLGPU);
https://github.com/Tencent/TurboTransformers/blob/master/turbo_transformers/core/allocator/naive_allocator.h#L48 https://github.com/Tencent/TurboTransformers/blob/master/turbo_transformers/core/allocator/naive_allocator.h#L73
请问,我能否确认一下,是把49行到63行,以及74到80行,替换成您上面建议的这两行代码,然后重新编译吗?
是的
报错了…… 求助~
[180/270] Building CXX object turbo_transformers...eFiles/tt_core.dir/allocator/allocator_api.cpp.o
FAILED: turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o
/usr/bin/c++ -DLOGURU_WITH_STREAMS=1 -DTT_BLAS_USE_MKL -DTT_WITH_CUDA -D__CLANG_SUPPORT_DYN_ANNOTATION__ -I/usr/local/cuda/include -I/mnt/TurboTransformers/3rd/cub -I/mnt/TurboTransformers/3rd/FP16/include -I/mnt/TurboTransformers -I/opt/miniconda3/include -I/mnt/TurboTransformers/3rd/abseil -I/mnt/TurboTransformers/3rd/dlpack/include -I/mnt/TurboTransformers/3rd/loguru -Wall -m64 -fopenmp -O3 -DNDEBUG -fPIC -std=gnu++14 -MD -MT turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o -MF turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o.d -o turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o -c /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_api.cpp
In file included from /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_api.cpp:21:0:
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h: In constructor 'turbo_transformers::core::allocator::ModelAwareAllocator::ModelAwareAllocator(const string&)':
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h:118:13: warning: 'turbo_transformers::core::allocator::ModelAwareAllocator::gpu_chunk_list_' will be initialized after [-Wreorder]
ChunkList gpu_chunk_list_;
^~~~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h:114:13: warning: 'turbo_transformers::core::allocator::ChunkList turbo_transformers::core::allocator::ModelAwareAllocator::cpu_chunk_list_' [-Wreorder]
ChunkList cpu_chunk_list_;
^~~~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h:34:12: warning: when initialized here [-Wreorder]
explicit ModelAwareAllocator(const std::string& model_name)
^~~~~~~~~~~~~~~~~~~
In file included from /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_api.cpp:22:0:
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h: In member function 'virtual void* turbo_transformers::core::allocator::NaiveAllocator::allocate(size_t, DLDeviceType, const string&)':
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:50:21: error: 'mem' was not declared in this scope
allocate_free(mem, kDLGPU);
^~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:50:7: error: 'allocate_free' was not declared in this scope
allocate_free(mem, kDLGPU);
^~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:50:7: note: suggested alternative: 'allocate_impl'
allocate_free(mem, kDLGPU);
^~~~~~~~~~~~~
allocate_impl
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h: In member function 'virtual void turbo_transformers::core::allocator::NaiveAllocator::free(void*, DLDeviceType, const string&)':
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:61:28: error: 'size' was not declared in this scope
return allocate_impl(size, kDLGPU);
^~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:61:28: note: suggested alternative: 'dysize'
return allocate_impl(size, kDLGPU);
^~~~
dysize
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:61:40: error: return-statement with a value, in function returning 'void' [-fpermissive]
return allocate_impl(size, kDLGPU);
^
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:62:7: error: 'allocate_free' was not declared in this scope
allocate_free(mem, kDLGPU);
^~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:62:7: note: suggested alternative: 'allocate_impl'
allocate_free(mem, kDLGPU);
^~~~~~~~~~~~~
allocate_impl
[187/270] Building CXX object turbo_transformers...iles/catch2_test_main.dir/catch2_test_main.cpp.o
ninja: build stopped: subcommand failed.
free_impl(mem, kDLGPU); 看一下cpu的api,改一下kDLGPU
看来稳定性似乎确实和显存分配有关系,按照您的建议修改naive allocator以后,调用了四万多次才报错。 另外,观察到两个现象,第一个是,dataloader输出8个tensor,每个都有tensor.to(gpu)的操作,但其实目前只用到两个tensor,如果把剩下不用的tensor.to(gpu)操作去掉,会更稳定,跑得更久。 第二个是,调用encoder的时候,外面套上with torch.no_grad()会不稳定,但是如果同时把encoder输出都加上.data.clone(),就会好很多。 总得来说,目前报错的几率降低了很多,但是跑多了还是会报错,不知道是否还有其它方法可以改进多次调用的稳定性?
你观察一下你的显存消耗是否稳定。是不是很多没释放的内存逐步积累导致程序崩溃的。
这个我之前观察过,哪怕报错的瞬间,显存也都没有超过50%,我也试过大batch入队让显存溢出,那种情况下会直接报显存不足的错,而不是这种an illegal memory access was encountered 不过,虽然显存没有溢出,但是之前显存是会有少量增长的,增长很慢,可能是出队的速度赶不上入队的速度导致的
用cuda-memcheck检查一下内存使用情况吧
好的,谢谢指点~
每次程序开始之前都会打印这些内容,能问一下这代表什么吗?是否有可能和之前的报错有联系呢……
date time ( uptime ) [ thread name/id ] file:line v|
2020-11-13 17:15:14.559 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:14.559 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /mnt/parallel
2020-11-13 17:15:14.559 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-11-13 17:15:14.559 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
date time ( uptime ) [ thread name/id ] file:line v|
2020-11-13 17:15:16.154 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:16.154 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /mnt/parallel
2020-11-13 17:15:16.154 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-11-13 17:15:16.154 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
date time ( uptime ) [ thread name/id ] file:line v|
2020-11-13 17:15:17.468 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:17.468 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /mnt/parallel
2020-11-13 17:15:17.468 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-11-13 17:15:17.468 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
date time ( uptime ) [ thread name/id ] file:line v|
2020-11-13 17:15:18.781 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:18.781 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /mnt/parallel
2020-11-13 17:15:18.781 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-11-13 17:15:18.781 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
date time ( uptime ) [ thread name/id ] file:line v|
2020-11-13 17:15:18.793 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:18.793 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /mnt/parallel
2020-11-13 17:15:18.793 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-11-13 17:15:18.793 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
date time ( uptime ) [ thread name/id ] file:line v|
2020-11-13 17:15:18.802 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:18.803 ( 0.000s) [main thread ] loguru.cpp:613 INFO| Current dir: /mnt/parallel
2020-11-13 17:15:18.803 ( 0.000s) [main thread ] loguru.cpp:615 INFO| stderr verbosity: 0
2020-11-13 17:15:18.803 ( 0.000s) [main thread ] loguru.cpp:616 INFO| -----------------------------------
2020-11-13 17:15:20.499 ( 1.705s) [main thread ] loguru.cpp:489 INFO| atexit
2020-11-13 17:15:20.956 ( 2.175s) [main thread ] loguru.cpp:489 INFO| atexit
2020-11-13 17:15:20.995 ( 2.192s) [main thread ] loguru.cpp:489 INFO| atexit
2020-11-13 17:15:22.328 ( 4.859s) [main thread ] loguru.cpp:489 INFO| atexit
没什么关系,可以用一下语句关闭 turbo_transformers.set_stderr_verbose_level(0)
您好!这边我按照您的建议将naive_allocator.h文件中的49行到63行,以及74到80行分别改成return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU); 在编译完后,运行时仍然在相同的地方出现an illegal memory access was encountered的问题,并没有效果,是不是有其他地方需要进行修改呢?谢谢!
您好!这边我按照您的建议将naive_allocator.h文件中的49行到63行,以及74到80行分别改成return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU); 在编译完后,运行时仍然在相同的地方出现an illegal memory access was encountered的问题,并没有效果,是不是有其他地方需要进行修改呢?谢谢!
你也是跑了个很多step才遇到这个问题吧?
我这边只跑了几百个step就遇到这种问题了,然后根据您的建议,修改了上述代码,还是没有效果,除了return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU);这两处,还有其他地方需要修改吗?谢谢!
我这边只跑了几百个step就遇到这种问题了,然后根据您的建议,修改了上述代码,还是没有效果,除了return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU);这两处,还有其他地方需要修改吗?谢谢!
估计是multiheadedattention的CUDA实现有内存泄露,你能抽出一个简单的单测给我debug么?
这边我们公司的电脑没法拷贝资料出去,我可能需要在自己的电脑上看看能不能将这个错误复现出来,谢谢哈!
可以随机初始化一个encoder-decoder模型,然后强制它decoder跑10000个step看看。因为我测试的decoder顶多一百多个step,可能没有发现内存泄露的问题。
您好!这边非法内存问题(an illegal memory access was encountered)在对随机一句话进行decoder时貌似很难出现,这边尝试将step增加到几千,并无发现问题。这个问题在我这边主要出现的情形是:循环遍历一定量的语句并对其翻译时,当循环到一定数量时,便会出现非法内存的问题。这种现象可能在运行了几十句之后,也可能在运行到几百句子之后出现,也有可能运行到结束时也不会出现。