sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] maximum recursion depth exceeded

Open kebe7jun opened this issue 10 months ago • 12 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [ ] 5. Please use English, otherwise it will be closed.

Describe the bug

Maximum recursion depth triggered on exception exit.

    self._send_signal(sig)
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 1266, in _send_signal
    os.kill(self.pid, sig)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 333, in sigquit_handler
    kill_process_tree(os.getpid())
  File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 492, in kill_process_tree
    children = itself.children(recursive=True)
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 971, in children
    self._raise_if_pid_reused()
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 461, in _raise_if_pid_reused
    if self._pid_reused or (not self.is_running() and self._pid_reused):
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 636, in is_running
    self._pid_reused = self != Process(self.pid)
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 319, in __init__
    self._init(pid)
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 355, in _init
    self._ident = self._get_ident()
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 396, in _get_ident
    return (self.pid, self.create_time())
  File "/usr/local/lib/python3.10/dist-packages/psutil/__init__.py", line 778, in create_time
    self._create_time = self._proc.create_time()
  File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1716, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1957, in create_time
    ctime = float(self._parse_stat_file()['create_time'])
  File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1716, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 508, in wrapper
    raise raise_from(err, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 506, in wrapper
    return fun(self)
  File "/usr/local/lib/python3.10/dist-packages/psutil/_pslinux.py", line 1784, in _parse_stat_file
    data = bcat("%s/%s/stat" % (self._procfs_path, self.pid))
  File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 851, in bcat
    return cat(fname, fallback=fallback, _open=open_binary)
  File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 839, in cat
    with _open(fname) as f:
  File "/usr/local/lib/python3.10/dist-packages/psutil/_common.py", line 799, in open_binary
    return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
RecursionError: maximum recursion depth exceeded while calling a Python object

Reproduction

N/A

Environment

root@g1805:/sgl-workspace# python3 -m sglang.check_env INFO 02-12 08:49:13 init.py:190] Automatically detected platform cuda. Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090 GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 550.78 PyTorch: 2.5.1+cu124 sgl_kernel: 0.0.3.post3 flashinfer: 0.2.0.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.48.3 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.61.1 tiktoken: 0.8.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB SYS SYS SYS SYS SYS SYS 0-27,56-83 0 N/A GPU1 PHB X SYS SYS SYS SYS SYS SYS 0-27,56-83 0 N/A GPU2 SYS SYS X PHB SYS SYS SYS SYS 0-27,56-83 0 N/A GPU3 SYS SYS PHB X SYS SYS SYS SYS 0-27,56-83 0 N/A GPU4 SYS SYS SYS SYS X PHB SYS SYS 28-55,84-111 1 N/A GPU5 SYS SYS SYS SYS PHB X SYS SYS 28-55,84-111 1 N/A GPU6 SYS SYS SYS SYS SYS SYS X PHB 28-55,84-111 1 N/A GPU7 SYS SYS SYS SYS SYS SYS PHB X 28-55,84-111 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 65535

kebe7jun avatar Feb 12 '25 08:02 kebe7jun

Image this is the issue in latest docker build

sk2011-ship-it avatar Feb 12 '25 13:02 sk2011-ship-it

Image this is the issue in latest docker build

This issue is duplicate as https://github.com/sgl-project/sglang/issues/3525

jhinpan avatar Feb 12 '25 18:02 jhinpan

Thank you for pointing that out @kebe7jun . We will quickly review your PR and let you know. cc @zhaochenyang20

jhinpan avatar Feb 12 '25 18:02 jhinpan

Do you have a PR to fix? @kebe7jun @sk2011-ship-it @jhinpan

zhaochenyang20 avatar Feb 13 '25 06:02 zhaochenyang20

@zhaochenyang20 I believe @kebe7jun 's PR to fix this issue is here https://github.com/sgl-project/sglang/pull/3519, waiting for check.

jhinpan avatar Feb 13 '25 19:02 jhinpan

@jhinpan I will take a look. THnaks!

zhaochenyang20 avatar Feb 14 '25 00:02 zhaochenyang20

I have installed datasets and the issue still exists, seems not the dependency problem

robscc avatar Feb 17 '25 01:02 robscc

Image this is the issue in latest docker build

how to enable the subprocess logging or watch the subprocess log? any tips would help are welcome

robscc avatar Feb 17 '25 01:02 robscc

I encountered the same problem. @zhaochenyang20

zwdgit avatar Feb 17 '25 04:02 zwdgit

https://github.com/sgl-project/sglang/pull/3519

I will merge this PR today. @zwdgit @jhinpan @kebe7jun Ping me if not finished today.

zhaochenyang20 avatar Feb 17 '25 17:02 zhaochenyang20

#3519

I will merge this PR today. @zwdgit @jhinpan @kebe7jun Ping me if not finished today.

@zhaochenyang20 The merger failed

zwdgit avatar Feb 19 '25 05:02 zwdgit

@zwdgit Too many to merge. Please remind me. Thanks!

zhaochenyang20 avatar Feb 19 '25 08:02 zhaochenyang20

Is there any new progress? @zhaochenyang20

issaccv avatar Feb 25 '25 07:02 issaccv

@issaccv I am trying to pass the ci and merge it

zhaochenyang20 avatar Feb 25 '25 08:02 zhaochenyang20

Just in case this issue persists ... The fix for us was to pip reinstall datasets. From there on we had no more crashes caused by recursion depth.

Cachet23 avatar Feb 26 '25 08:02 Cachet23