VoxFormer icon indicating copy to clipboard operation
VoxFormer copied to clipboard

preds are nan

Open zhangzaibin opened this issue 1 year ago • 6 comments

Thanks for your great work. I have a issue. In stage2, my preds are nan at the start of training and it turns out error. Have you ever encounted this problem? I train using VoxFormer-T

zhangzaibin avatar May 26 '23 01:05 zhangzaibin

Me too have this problem

KSonPham avatar Jun 10 '23 10:06 KSonPham

Varying machines exhibit different behaviours. Can you attempt multiple tries?

RoboticsYimingLi avatar Jun 11 '23 03:06 RoboticsYimingLi

Yes, for me the problem goes away when i set worker to 0 (not always the case) or run in a docker environment (no error what soever). Another problem is setting large number of worker such as 4 (default) filled up my 32 gb memory.

KSonPham avatar Jun 11 '23 06:06 KSonPham

it is a CUDA memory error? what(): CUDA error: an illegal memory access was encountered

ziming-liu avatar Aug 27 '23 20:08 ziming-liu

我也遇到了相似的问题,: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行

willemeng avatar Nov 07 '23 08:11 willemeng

我也遇到了相似的问题,: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行

I also encountered this issue. Deleting the ./VoxFormer/deform_attn_3d directory and re-uploading it resolved the issue. I'm curious about the reason and hope the author can provide an explanation.

zzk785089755 avatar Jan 29 '24 12:01 zzk785089755