fastertransformer_backend
fastertransformer_backend copied to clipboard
Failed to run on H100 GPU with tensor para=8
The same setup works fine on A100x8, but on H100x8, saw below errors.
Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 30) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000001677b uct_iface_mp_chunk_alloc_inner() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:467
2 0x000000000001677b uct_iface_mp_chunk_alloc() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:443
3 0x0000000000052c4b ucs_mpool_grow() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:266
4 0x0000000000052ec9 ucs_mpool_get_grow() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:316
5 0x000000000001b418 uct_mm_iface_t_init() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/sm/mm/base/mm_iface.c:821
We have run into the same issue. Does anyone have any clue?
Hi @sfc-gh-zhwang , have you found a solution yet? I'm having the same issue here with running it on Kubernetes.
@Wenhan-Tan I just encountered the same issue. The reason I ran into this problem was that I had enabled hugepages on the physical machine, and UCX triggered a SIGBUS when trying to allocate memory using hugepages. Everything worked fine after I disabled hugepages.
@sphish Thank you! I saw another similar issue here (https://github.com/NVIDIA/TensorRT-LLM/issues/674) which uses TRT-LLM instead of FT. But in that issue, huge pages need be enabled. I'll try disabling huge pages first.
@sphish Thank you! I saw another similar issue here (NVIDIA/TensorRT-LLM#674) which uses TRT-LLM instead of FT. But in that issue, huge pages need be enabled. I'll try disabling huge pages first.
I think the key is that containers and bare metal need to have the same configuration