inference icon indicating copy to clipboard operation
inference copied to clipboard

安装完成后xinference-local --host 0.0.0.0 --port 9997运行报错

Open pan-common opened this issue 1 year ago • 7 comments

System Info / 系統信息

ubuntu20.0.4 NVIDIA-SMI 535.104.05
Driver Version: 535.104.05 CUDA Version: 12.2

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • [ ] docker / docker
  • [X] pip install / 通过 pip install 安装
  • [ ] installation from source / 从源码安装

Version info / 版本信息

Name: xinference Version: 0.13.0 Summary: Model Serving Made Easy Home-page: https://github.com/xorbitsai/inference Author: Qin Xuye Author-email: [email protected] License: Apache License 2.0 Location: /root/anaconda3/envs/py311/lib/python3.11/site-packages Requires: aioprometheus, async-timeout, click, fastapi, fsspec, gradio, huggingface-hub, modelscope, openai, opencv-contrib-python, passlib, peft, pillow, pydantic, pynvml, python-jose, requests, s3fs, sse-starlette, tabulate, timm, torch, tqdm, typer, typing-extensions, uvicorn, xoscar Required-by:

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

(py311) root@b721c068038e:/opt/xinference# xinference-local --host 0.0.0.0 --port 9997 2024-07-10 12:28:08,395 xinference.core.supervisor 83095 INFO Xinference supervisor 0.0.0.0:44062 started 2024-07-10 12:28:08,425 xinference.core.worker 83095 INFO Starting metrics export server at 0.0.0.0:None 2024-07-10 12:28:08,431 xinference.core.worker 83095 INFO Checking metrics export server... 2024-07-10 12:28:09,600 xinference.core.worker 83095 INFO Metrics server is started at: http://0.0.0.0:41815 2024-07-10 12:28:09,601 xinference.core.worker 83095 INFO Xinference worker 0.0.0.0:44062 started 2024-07-10 12:28:09,602 xinference.core.worker 83095 INFO Purge cache directory: /root/.xinference/cache 2024-07-10 12:28:11,604 xinference.core.worker 83095 ERROR Report status got error. Traceback (most recent call last): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 800, in report_status status = await asyncio.to_thread(gather_node_info) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 799, in report_status async with timeout(2): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit self._do_exit(exc_type) File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit raise asyncio.TimeoutError TimeoutError 2024-07-10 12:28:14,296 xinference.api.restful_api 82961 INFO Starting Xinference at endpoint: http://0.0.0.0:9997 2024-07-10 12:28:14,648 uvicorn.error 82961 INFO Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit) 2024-07-10 12:28:18,618 xinference.core.worker 83095 ERROR Report status got error. Traceback (most recent call last): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 800, in report_status status = await asyncio.to_thread(gather_node_info) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 799, in report_status async with timeout(2): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit self._do_exit(exc_type) File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit raise asyncio.TimeoutError TimeoutError 2024-07-10 12:28:25,628 xinference.core.worker 83095 ERROR Report status got error. Traceback (most recent call last): File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 800, in report_status status = await asyncio.to_thread(gather_node_info) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Expected behavior / 期待表现

可以正常使用gpu显卡运行

pan-common avatar Jul 10 '24 12:07 pan-common

@pan-common worker向supervisor汇报状态时出错。 先尝试打开debug日志(另外你的错误没给全,请把完整的全贴上来,During handling of the above exception, another exception occurred:这句后面的都贴出来),看看有没有具体错误。 然后这样可以绕过汇报流程,看看能不能启动

XINFERENCE_DISABLE_HEALTH_CHECK=1 xinference-local --host 0.0.0.0 --port 9997

ChengjieLi28 avatar Jul 11 '24 02:07 ChengjieLi28

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Jul 19 '24 19:07 github-actions[bot]

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 06 '24 06:08 github-actions[bot]

q

@pan-common worker向supervisor汇报状态时出错。 先尝试打开debug日志(另外你的错误没给全,请把完整的全贴上来,During handling of the above exception, another exception occurred:这句后面的都贴出来),看看有没有具体错误。 然后这样可以绕过汇报流程,看看能不能启动

XINFERENCE_DISABLE_HEALTH_CHECK=1 xinference-local --host 0.0.0.0 --port 9997

我也遇到这个问题, 按你说的增加XINFERENCE_DISABLE_HEALTH_CHECK=1 配置就可以启动了. 报错具体内容如下

` WARNING 09-26 17:01:26 _custom_ops.py:18] Failed to import from vllm._C with ImportError('/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled') 2024-09-26 17:01:32,290 xinference.core.supervisor 667146 INFO Xinference supervisor 127.0.0.1:22599 started /home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 2024-09-26 17:01:32,316 xinference.core.worker 667146 INFO Starting metrics export server at 127.0.0.1:None 2024-09-26 17:01:32,322 xinference.core.worker 667146 INFO Checking metrics export server... 2024-09-26 17:01:34,445 xinference.core.worker 667146 INFO Metrics server is started at: http://127.0.0.1:34503 2024-09-26 17:01:34,446 xinference.core.worker 667146 INFO Purge cache directory: /home/hum/.xinference/cache 2024-09-26 17:01:34,449 xinference.core.supervisor 667146 DEBUG [request ee1ead84-7be5-11ef-9d4d-208810cdd0e8] Enter add_worker, args: <xinference.core.supervisor.SupervisorActor object at 0x7f7fa559aff0>,127.0.0.1:22599, kwargs: 2024-09-26 17:01:34,449 xinference.core.supervisor 667146 DEBUG Worker 127.0.0.1:22599 has been added successfully 2024-09-26 17:01:34,449 xinference.core.supervisor 667146 DEBUG [request ee1ead84-7be5-11ef-9d4d-208810cdd0e8] Leave add_worker, elapsed time: 0 s 2024-09-26 17:01:34,449 xinference.core.worker 667146 INFO Connected to supervisor as a fresh worker 2024-09-26 17:01:34,463 xinference.core.worker 667146 INFO Xinference worker 127.0.0.1:22599 started 2024-09-26 17:01:36,466 xinference.core.worker 667146 ERROR Report status got error. Traceback (most recent call last): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1026, in report_status status = await asyncio.to_thread(gather_node_info) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/hum/anaconda3/envs/xinf/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1025, in report_status async with timeout(2): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit self._do_exit(exc_type) File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit raise asyncio.TimeoutError TimeoutError 2024-09-26 17:01:36,477 xinference.core.supervisor 667146 DEBUG Worker 127.0.0.1:22599 resources: {} 2024-09-26 17:01:37,274 xinference.core.supervisor 667146 DEBUG Enter get_status, args: <xinference.core.supervisor.SupervisorActor object at 0x7f7fa559aff0>, kwargs: 2024-09-26 17:01:37,275 xinference.core.supervisor 667146 DEBUG Leave get_status, elapsed time: 0 s 2024-09-26 17:01:39,377 xinference.api.restful_api 666994 INFO Starting Xinference at endpoint: http://127.0.0.1:9997 2024-09-26 17:01:39,543 uvicorn.error 666994 INFO Uvicorn running on http://127.0.0.1:9997 (Press CTRL+C to quit) 2024-09-26 17:01:43,485 xinference.core.worker 667146 ERROR Report status got error. Traceback (most recent call last): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1026, in report_status status = await asyncio.to_thread(gather_node_info) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/hum/anaconda3/envs/xinf/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1025, in report_status async with timeout(2): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit self._do_exit(exc_type) File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit raise asyncio.TimeoutError TimeoutError 2024-09-26 17:01:50,493 xinference.core.worker 667146 ERROR Report status got error. Traceback (most recent call last): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1026, in report_status status = await asyncio.to_thread(gather_node_info) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/hum/anaconda3/envs/xinf/lib/python3.11/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1025, in report_status async with timeout(2): File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit self._do_exit(exc_type) File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit raise asyncio.TimeoutError TimeoutError

`

gs80140 avatar Sep 26 '24 09:09 gs80140

对CUDA有要求的吧?

gs80140 avatar Sep 26 '24 09:09 gs80140

最新版本,同样报错,启动很慢,不知道什么原因

jiajunly avatar Nov 24 '24 08:11 jiajunly

@pan-common worker向supervisor汇报状态时出错。 先尝试打开debug日志(另外你的错误没给全,请把完整的全贴上来,During handling of the above exception, another exception occurred:这句后面的都贴出来),看看有没有具体错误。 然后这样可以绕过汇报流程,看看能不能启动

XINFERENCE_DISABLE_HEALTH_CHECK=1 xinference-local --host 0.0.0.0 --port 9997

后续报错,就是重复TimeoutError,应该是在反复尝试。绕过汇报流程后可以很快开启。

jiajunly avatar Nov 24 '24 08:11 jiajunly