Mooncake
Mooncake copied to clipboard
[Bug]: Segfault encountered running Mooncake EP with SGLang.
Bug Report
Testbed: p5en. 2 node H200:8
Command:
python -m sglang.launch_server \
--model-path "$MODEL_PATH" \
--tp-size 16 \
--ep-size 16 \
--dp-size 16 \
--chunked-prefill-size 65536 \
--nnodes "$NNODES" \
--node-rank "$NODE_RANK" \
--dist-init-addr "$DIST_ADDR" \
--trust-remote-code \
--elastic-ep-backend mooncake \
--mem-fraction-static 0.85 \
--attention-backend flashinfer \
--ep-num-redundant-experts 16 \
--ep-dispatch-algorithm dynamic \
--enable-dp-attention \
--enable-dp-lm-head \
--moe-a2a-backend mooncake
Error message:
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007599e9c5051f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000072597598051f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007599e9c5051f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000072597598051f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007b2adaa8051f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007b2adaa8051f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007ddf27c8251f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007ddf27c8251f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x0000700eb3b9c51f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000070865383751f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000070865383751f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x0000700eb3b9c51f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007acd0fbd851f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007acd0fbd851f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000073d17602751f
File "<unknown>", line 0, in ibv_destroy_qp
File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
File "./nptl/pthread_create.c", line 442, in start_thread
File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
File "<unknown>", line 0, in 0xffffffffffffffff
Before submitting...
- [ ] Ensure you searched for relevant issues and read the [documentation]
Thank you for your report. Could you recall any error logs that showed up before the segment fault message, like
Failed to create QP: xxx
or
Failed to allocate memory for work request depth list
This may help me investigate!