[Bug] maximum recursion depth exceeded
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
Maximum recursion depth triggered on exception exit.
self._send_signal(sig)
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 1266, in _send_signal
os.kill(self.pid, sig)
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 333, in sigquit_handler
kill_process_tree(os.getpid())
File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 492, in kill_process_tree
children = itself.children(recursive=True)
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 971, in children
self._raise_if_pid_reused()
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 461, in _raise_if_pid_reused
if self._pid_reused or (not self.is_running() and self._pid_reused):
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 636, in is_running
self._pid_reused = self != Process(self.pid)
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 319, in __init__
self._init(pid)
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 355, in _init
self._ident = self._get_ident()
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 396, in _get_ident
return (self.pid, self.create_time())
File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 778, in create_time
self._create_time = self._proc.create_time()
File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1716, in wrapper
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1957, in create_time
ctime = float(self._parse_stat_file()['create_time'])
File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1716, in wrapper
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 508, in wrapper
raise raise_from(err, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 506, in wrapper
return fun(self)
File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1784, in _parse_stat_file
data = bcat("%s/%s/stat" % (self._procfs_path, self.pid))
File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 851, in bcat
return cat(fname, fallback=fallback, _open=open_binary)
File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 839, in cat
with _open(fname) as f:
File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 799, in open_binary
return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
RecursionError: maximum recursion depth exceeded while calling a Python object
Reproduction
N/A
Environment
root@g1805:/sgl-workspace# python3 -m sglang.check_env INFO 02-12 08:49:13 init.py:190] Automatically detected platform cuda. Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090 GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 550.78 PyTorch: 2.5.1+cu124 sgl_kernel: 0.0.3.post3 flashinfer: 0.2.0.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.48.3 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.61.1 tiktoken: 0.8.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB SYS SYS SYS SYS SYS SYS 0-27,56-83 0 N/A GPU1 PHB X SYS SYS SYS SYS SYS SYS 0-27,56-83 0 N/A GPU2 SYS SYS X PHB SYS SYS SYS SYS 0-27,56-83 0 N/A GPU3 SYS SYS PHB X SYS SYS SYS SYS 0-27,56-83 0 N/A GPU4 SYS SYS SYS SYS X PHB SYS SYS 28-55,84-111 1 N/A GPU5 SYS SYS SYS SYS PHB X SYS SYS 28-55,84-111 1 N/A GPU6 SYS SYS SYS SYS SYS SYS X PHB 28-55,84-111 1 N/A GPU7 SYS SYS SYS SYS SYS SYS PHB X 28-55,84-111 1 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 65535
this is the issue in latest docker build
this is the issue in latest docker build
This issue is duplicate as https://github.com/sgl-project/sglang/issues/3525
Thank you for pointing that out @kebe7jun . We will quickly review your PR and let you know. cc @zhaochenyang20
Do you have a PR to fix? @kebe7jun @sk2011-ship-it @jhinpan
@zhaochenyang20 I believe @kebe7jun 's PR to fix this issue is here https://github.com/sgl-project/sglang/pull/3519, waiting for check.
@jhinpan I will take a look. THnaks!
I have installed datasets and the issue still exists, seems not the dependency problem
this is the issue in latest docker build
how to enable the subprocess logging or watch the subprocess log? any tips would help are welcome
I encountered the same problem. @zhaochenyang20
https://github.com/sgl-project/sglang/pull/3519
I will merge this PR today. @zwdgit @jhinpan @kebe7jun Ping me if not finished today.
I will merge this PR today. @zwdgit @jhinpan @kebe7jun Ping me if not finished today.
@zhaochenyang20 The merger failed
@zwdgit Too many to merge. Please remind me. Thanks!
Is there any new progress? @zhaochenyang20
@issaccv I am trying to pass the ci and merge it
Just in case this issue persists ... The fix for us was to pip reinstall datasets. From there on we had no more crashes caused by recursion depth.
this is the issue in latest docker build
this is the issue in latest docker build