Kaizhi Qian
Kaizhi Qian
By the way, I found after removing the "with torch.no_grad" over the inference function, it is less likely to crash and runs longer. Not sure if this information is helpful.
又debug了两天,gcc版本,pytorch版本,docker镜像,都换过了,其它能想到的一些方法也试了,只要调用次数一多,还是会报错。 基本上发现是,对于turbo decoder的输入和输出做的操作越少,跑得越久,输入和输出操作前都要deepcopy,否则很快就报错了。但是这样也没有从根本上解决问题,最多调用一万次左右,还是会报错。我猜可能是turbo本身不稳定,反复多次调用容易出问题?
请问,我能否确认一下,是把49行到63行,以及74到80行,替换成您上面建议的这两行代码,然后重新编译吗?
报错了…… 求助~ ``` [180/270] Building CXX object turbo_transformers...eFiles/tt_core.dir/allocator/allocator_api.cpp.o FAILED: turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o /usr/bin/c++ -DLOGURU_WITH_STREAMS=1 -DTT_BLAS_USE_MKL -DTT_WITH_CUDA -D__CLANG_SUPPORT_DYN_ANNOTATION__ -I/usr/local/cuda/include -I/mnt/TurboTransformers/3rd/cub -I/mnt/TurboTransformers/3rd/FP16/include -I/mnt/TurboTransformers -I/opt/miniconda3/include -I/mnt/TurboTransformers/3rd/abseil -I/mnt/TurboTransformers/3rd/dlpack/include -I/mnt/TurboTransformers/3rd/loguru -Wall -m64 -fopenmp -O3 -DNDEBUG -fPIC -std=gnu++14...
看来稳定性似乎确实和显存分配有关系,按照您的建议修改naive allocator以后,调用了四万多次才报错。 另外,观察到两个现象,第一个是,dataloader输出8个tensor,每个都有tensor.to(gpu)的操作,但其实目前只用到两个tensor,如果把剩下不用的tensor.to(gpu)操作去掉,会更稳定,跑得更久。 第二个是,调用encoder的时候,外面套上with torch.no_grad()会不稳定,但是如果同时把encoder输出都加上.data.clone(),就会好很多。 总得来说,目前报错的几率降低了很多,但是跑多了还是会报错,不知道是否还有其它方法可以改进多次调用的稳定性?
这个我之前观察过,哪怕报错的瞬间,显存也都没有超过50%,我也试过大batch入队让显存溢出,那种情况下会直接报显存不足的错,而不是这种an illegal memory access was encountered 不过,虽然显存没有溢出,但是之前显存是会有少量增长的,增长很慢,可能是出队的速度赶不上入队的速度导致的
每次程序开始之前都会打印这些内容,能问一下这代表什么吗?是否有可能和之前的报错有联系呢…… ``` date time ( uptime ) [ thread name/id ] file:line v| 2020-11-13 17:15:14.559 ( 0.000s) [main thread ] loguru.cpp:610 INFO| arguments: turbo_transformers_cxx 2020-11-13 17:15:14.559 ( 0.000s) [main thread...
It's a long story. I have many other packages installed in the same image, and I have used this for development since the very beginning. If I switch to conda,...
After hours of struggling with strange errors, I finally got the following output. It is compiled using gcc/g++-6.5.0. It is running on v100 gpu. Do these numbers look reasonable to...