Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[Bug]: Segfault encountered running Mooncake EP with SGLang.

Open MaoZiming opened this issue 1 month ago • 1 comments

Bug Report

Testbed: p5en. 2 node H200:8

Command:

python -m sglang.launch_server \
  --model-path "$MODEL_PATH" \
  --tp-size 16 \
  --ep-size 16 \
  --dp-size 16 \
  --chunked-prefill-size 65536 \
  --nnodes "$NNODES" \
  --node-rank "$NODE_RANK" \
  --dist-init-addr "$DIST_ADDR" \
  --trust-remote-code \
  --elastic-ep-backend mooncake \
  --mem-fraction-static 0.85 \
  --attention-backend flashinfer \
  --ep-num-redundant-experts 16 \
  --ep-dispatch-algorithm dynamic \
  --enable-dp-attention \
  --enable-dp-lm-head \
  --moe-a2a-backend mooncake

Error message:

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007599e9c5051f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000072597598051f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007599e9c5051f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000072597598051f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007b2adaa8051f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007b2adaa8051f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007ddf27c8251f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007ddf27c8251f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x0000700eb3b9c51f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000070865383751f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000070865383751f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x0000700eb3b9c51f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007acd0fbd851f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x00007acd0fbd851f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

!!!!!!! Segfault encountered !!!!!!!
  File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000073d17602751f
  File "<unknown>", line 0, in ibv_destroy_qp
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::deconstruct()
  File "<unknown>", line 0, in mooncake::RdmaEndPoint::~RdmaEndPoint()
  File "<unknown>", line 0, in mooncake::SIEVEEndpointStore::insertEndpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mooncake::RdmaContext*)
  File "<unknown>", line 0, in mooncake::RdmaContext::endpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  File "<unknown>", line 0, in mooncake::WorkerPool::performPostSend(int)
  File "<unknown>", line 0, in mooncake::WorkerPool::transferWorker(int)
  File "./nptl/pthread_create.c", line 442, in start_thread
  File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __GI___clone3
  File "<unknown>", line 0, in 0xffffffffffffffff

Before submitting...

  • [ ] Ensure you searched for relevant issues and read the [documentation]

MaoZiming avatar Nov 02 '25 19:11 MaoZiming

Thank you for your report. Could you recall any error logs that showed up before the segment fault message, like

Failed to create QP: xxx

or

Failed to allocate memory for work request depth list

This may help me investigate!

UNIDY2002 avatar Nov 06 '25 03:11 UNIDY2002