DB-GPT icon indicating copy to clipboard operation
DB-GPT copied to clipboard

[Bug] llmserver.py 执行至Loading checkpoint shards: 100% 抛错

Open gantao21 opened this issue 1 year ago • 21 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

3.10

DB-GPT version

main

Related scenes

  • [X] Chat Data
  • [ ] Chat Excel
  • [ ] Chat DB
  • [ ] Chat Knowledge
  • [ ] Model Management
  • [ ] Dashboard
  • [ ] Plugins

Installation Information

Device information

T4 显卡 显卡数量:1 显存:15

Models information

vicuna-13b-v1.5("load_in_4bit": true) text2vec-large-chinese

What happened

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:30<00:00, 50.26s/it] /home/miniconda3/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /home/miniconda3/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( 2023-10-13 08:57:01 k161ae pilot.model.loader[4436] INFO Current model is type of: LlamaForCausalLM, load tokenizer by LlamaTokenizer 2023-10-13 08:57:01 k161ae pilot.model.cluster.worker.manager[4436] ERROR Error starting worker manager: expected str, bytes or os.PathLike object, not NoneType 2023-10-13 08:57:01 k161ae asyncio[4436] ERROR Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at /home/DB-GPT/pilot/model/cluster/worker/manager.py:657> exception=SystemExit(1)> Traceback (most recent call last): File "/home/DB-GPT/pilot/model/cluster/worker/manager.py", line 659, in start_worker_manager

What you expected to happen

1、T4 显卡 显存不足?

How to reproduce

执行:python /home/DB-GPT/pilot/server/llmserver.py

Additional context

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

gantao21 avatar Oct 13 '23 01:10 gantao21

@gantao21 你好,建议先试试 7B 的模型,不开量化或者开8bit量化,另外拉一下最新 main 分支代码,然后根据 最新文档 来部署。

如果还有什么问题可以执行命令 dbgpt trace chat导出一下相关信息,方便一起排查。

fangyinc avatar Oct 13 '23 11:10 fangyinc

请问这个问题解决了吗,V100卡运行的时候也遇到了同样的问题

lv-stupidboy avatar Oct 27 '23 01:10 lv-stupidboy

请问这个问题解决了吗,V100卡运行的时候也遇到了同样的问题

显存不足的问题,根据我上面的说明试试。

你好,建议先试试 7B 的模型,不开量化或者开8bit量化,另外拉一下最新 main 分支代码,然后根据 最新文档 来部署。

如果还有什么问题可以执行命令 dbgpt trace chat导出一下相关信息,方便一起排查。

fangyinc avatar Oct 27 '23 01:10 fangyinc

用vicuna-7b 的可以,百川7B 的不行。

gantao21 avatar Oct 27 '23 02:10 gantao21

好的,我试一下其他模型 比较疑惑的是:V100卡的显存有32G,模型启动的时候只占用了40%,我现在部署的是0.4.0版本,0.3.4版本同样的机器上也能跑13b-v1.5

lv-stupidboy avatar Oct 27 '23 02:10 lv-stupidboy

好的,我试一下其他模型 比较疑惑的是:V100卡的显存有32G,模型启动的时候只占用了40%,我现在部署的是0.4.0版本,0.3.4版本同样的机器上也能跑13b-v1.5

感觉不太应该,可以发一下具体信息么,我看看 dbgpt trace chat 导出的信息。

fangyinc avatar Oct 27 '23 02:10 fangyinc

+------------------------+------------------------------+-----------------------------+-----------------------------------------------------------------------------+ | Config Key (Webserver) | Config Value (Webserver) | Config Key (EmbeddingModel) | Config Value (EmbeddingModel) | +------------------------+------------------------------+-----------------------------+-----------------------------------------------------------------------------+ | host | 0.0.0.0 | model_name | text2vec | | port | 5000 | model_path | DBGPT_v0.4.0/DB-GPT-0.4.0/models/text2vec-large-chinese | | daemon | False | device | cuda | | controller_addr | None | normalize_embeddings | None | | model_name | None | | | | share | False | | | | remote_embedding | False | | | | log_level | None | | | | light | False | | | | log_file | dbgpt_webserver.log | | | | tracer_file | dbgpt_webserver_tracer.jsonl | | | +------------------------+------------------------------+-----------------------------+-----------------------------------------------------------------------------+ +--------------------------+----------------------------------------------------------------------+----------------------------+----------------------------------------------------------------------+ | Config Key (ModelWorker) | Config Value (ModelWorker) | Config Key (WorkerManager) | Config Value (WorkerManager) | +--------------------------+----------------------------------------------------------------------+----------------------------+----------------------------------------------------------------------+ | model_name | vicuna-13b-v1.5 | model_name | vicuna-13b-v1.5 | | model_path | /DB-GPT-0.4.0/models/vicuna-13b-v1.5 | model_path | /DBGPT_v0.4.0/DB-GPT-0.4.0/models/vicuna-13b-v1.5 | | device | cuda | worker_type | None | | model_type | huggingface | worker_class | None | | prompt_template | None | model_type | huggingface | | max_context_size | 4096 | host | 0.0.0.0 | | num_gpus | None | port | 5000 | | max_gpu_memory | None | daemon | False | | cpu_offloading | False | limit_model_concurrency | 5 | | load_8bit | True | standalone | True | | load_4bit | False | register | True | | quant_type | nf4 | worker_register_host | None | | use_double_quant | True | controller_addr | http://127.0.0.1:5000 | | compute_dtype | None | send_heartbeat | True | | trust_remote_code | True | heartbeat_interval | 20 | | verbose | False | log_level | None | | | | log_file | dbgpt_model_worker_manager.log | | | | tracer_file | dbgpt_model_worker_manager_tracer.jsonl | +--------------------------+----------------------------------------------------------------------+----------------------------+----------------------------------------------------------------------+ +----------------------------------------------------------------------------------------------------+ | ModelWorker System information | +-------------------+--------------------------------------------------------------------------------+ | System Config Key | System Config Value | +-------------------+--------------------------------------------------------------------------------+ | platform | linux | | python_version | 3.10.13 | | cpu | Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz | | cpu_avx | AVX512 | | memory | 263601180 kB | | torch_version | 2.0.1+cu117 | | device | cuda | | device_version | 11.7 | | device_count | 4 | | device_other | name, driver_version, memory.total [MiB], memory.free [MiB], memory.used [MiB] | | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | | | +-------------------+--------------------------------------------------------------------------------+

lv-stupidboy avatar Oct 27 '23 02:10 lv-stupidboy

@lv-stupidboy 这个是可以正常启动 vicuna-13b-v1.5,但是在一些场景使用中会报显存不足的异常么?我看你已经开了 8bit 量化了,32G 显存单个用户使用不应该出现爆显存的。

fangyinc avatar Oct 27 '23 03:10 fangyinc

但是现在服务启动时和上面的朋友完全一致的现象,服务启动失败。 Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

lv-stupidboy avatar Oct 27 '23 03:10 lv-stupidboy

但是现在服务启动时和上面的朋友完全一致的现象,服务启动失败。 Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

目前不是很好复现你的问题,我看你有四张卡,可以考虑全部都利用起来,试试把量化关了,然后强制设置每张卡的最大GPU使用试试。

QUANTIZE_8bit=False
QUANTIZE_4bit=False
MAX_GPU_MEMORY=8Gib

fangyinc avatar Oct 27 '23 03:10 fangyinc

QUANTIZE_8bit=False
QUANTIZE_4bit=False
CUDA_VISIBLE_DEVICES=0,1,2,3
MAX_GPU_MEMORY=8Gib

@fangyinc 大佬,调整了配置还是一样的现象 envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sampleis set toFalse. However, temperatureis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sampleis set toFalse. However, top_pis set to0.6-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettop_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should befixed. warnings.warn( ERROR [pilot.model.cluster.worker.manager] Error starting worker manager: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed ERROR [asyncio] Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py:758> exception=SystemExit(1)> Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1 INFO [pilot.model.cluster.worker.manager] Stop all workers INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._stop_all_worker.._stop_worker at 0x7f93b03ffe20> INFO [pilot.model.cluster.worker.manager] Apply to all workers WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:0 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:1 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:2 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:3 WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed ERROR: Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1511, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1504, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1377, in uvloop.loop.Loop.run_forever File "uvloop/loop.pyx", line 555, in uvloop.loop.Loop._run File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan await receive() File "envs/dbgpt040/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "envs/dbgpt040/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError `

lv-stupidboy avatar Oct 27 '23 03:10 lv-stupidboy

QUANTIZE_8bit=False
QUANTIZE_4bit=False
CUDA_VISIBLE_DEVICES=0,1,2,3
MAX_GPU_MEMORY=8Gib

@fangyinc 大佬,调整了配置还是一样的现象 envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sampleis set toFalse. However, temperatureis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sampleis set toFalse. However, top_pis set to0.6-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettop_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should befixed. warnings.warn( ERROR [pilot.model.cluster.worker.manager] Error starting worker manager: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed ERROR [asyncio] Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py:758> exception=SystemExit(1)> Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1 INFO [pilot.model.cluster.worker.manager] Stop all workers INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._stop_all_worker.._stop_worker at 0x7f93b03ffe20> INFO [pilot.model.cluster.worker.manager] Apply to all workers WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:0 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:1 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:2 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:3 WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed ERROR: Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1511, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1504, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1377, in uvloop.loop.Loop.run_forever File "uvloop/loop.pyx", line 555, in uvloop.loop.Loop._run File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan await receive() File "envs/dbgpt040/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "envs/dbgpt040/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError `

辛苦发一下其它的错误,在这些通用错误前面应该还有具体的错误原因。

fangyinc avatar Oct 27 '23 03:10 fangyinc

服务启动打印的日志都在这里了,辛苦大佬看一下 `# python pilot/server/dbgpt_server.py --host xx.xx.xx.xx --port 7860

=========================== WebWerverParameters ===========================

host: xx.xx.xx.xx port: 7860 daemon: False controller_addr: None model_name: None share: False remote_embedding: False log_level: INFO light: False log_file: dbgpt_webserver.log tracer_file: dbgpt_webserver_tracer.jsonl

======================================================================

4e05d94b5799 (head) heads:None INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. Generating DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/meta_data/alembic/versions/91c18e894c6e_dbgpt_ddl_upate.py ... done INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. INFO [alembic.runtime.migration] Running upgrade 4e05d94b5799 -> 91c18e894c6e, dbgpt ddl upate INFO [pilot.model.cluster.worker.embedding_worker] [EmbeddingsModelWorker] Parameters of device is None, use cuda WARNI [sentence_transformers.SentenceTransformer] No sentence-transformers model found with name DBGPT_v0.4.0/DB-GPT-0.4.0/models/text2vec-large-chinese. Creating a new one with MEAN pooling. Model Unified Deployment Mode! INFO: Started server process [1872205] INFO: Waiting for application startup. INFO [pilot.model.cluster.worker.manager] Begin start all worker, apply_req: None INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._start_all_worker.._start_worker at 0x7f93d00a7370> INFO [pilot.model.cluster.worker.manager] Apply to all workers INFO: Application startup complete. INFO [pilot.model.cluster.worker.default_worker] Begin load model, model params:

=========================== ModelParameters ===========================

model_name: vicuna-13b-v1.5 model_path: DBGPT_v0.4.0/DB-GPT-0.4.0/models/vicuna-13b-v1.5 device: cuda model_type: huggingface prompt_template: None max_context_size: 4096 num_gpus: None max_gpu_memory: 8Gib cpu_offloading: False load_8bit: False load_4bit: False quant_type: nf4 use_double_quant: True compute_dtype: None trust_remote_code: True verbose: False

======================================================================

INFO: Uvicorn running on http://xx.xx.xx.xx:7860 (Press CTRL+C to quit) INFO [pilot.model.loader] There has max_gpu_memory from config: 8Gib Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00, 6.27s/it] /envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should befixed. warnings.warn( ERROR [pilot.model.cluster.worker.manager] Error starting worker manager: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed ERROR [asyncio] Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py:758> exception=SystemExit(1)> Traceback (most recent call last): File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1 INFO [pilot.model.cluster.worker.manager] Stop all workers INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._stop_all_worker.._stop_worker at 0x7f93b03ffe20> INFO [pilot.model.cluster.worker.manager] Apply to all workers WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:0 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:1 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:2 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:3 WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed ERROR: Traceback (most recent call last): File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/envs/dbgpt040/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1511, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1504, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1377, in uvloop.loop.Loop.run_forever File "uvloop/loop.pyx", line 555, in uvloop.loop.Loop._run File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/envs/dbgpt040/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan await receive() File "/envs/dbgpt040/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "/envs/dbgpt040/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError `

lv-stupidboy avatar Oct 27 '23 06:10 lv-stupidboy

@lv-stupidboy 你好,启动命令 python pilot/server/dbgpt_server.py --host xx.xx.xx.xx --port 7860 中的是xx.xx.xx.xx 是做了特殊处理么,我看你是单机模式启动的,你的场景这个参数应该可以不用填,默认是 0.0.0.0 ,表示监听本机的所有ip地址。

fangyinc avatar Oct 27 '23 06:10 fangyinc

@fangyinc 没有特殊处理, IP我在发出来之前给屏蔽掉了,加上ip和host,是因为在linux机器上启动服务,我需要通过ip:port的模式去访问web服务,用默认参数我在web端访问不到

lv-stupidboy avatar Oct 27 '23 06:10 lv-stupidboy

@fangyinc 没有特殊处理, IP我在发出来之前给屏蔽掉了,加上ip和host,是因为在linux机器上启动服务,我需要通过ip:port的模式去访问web服务,用默认参数我在web端访问不到

所以不填 --host 参数是能正常启动的么? 这个问题应该是跟你的 --host 的ip地址有关系,统一部署的模式下,DB-GPT 在默认会启动多个组件,其中需要一个 http://127.0.0.1:port 地址来通信,这里是由于是你指定的 host 启动,导致http://127.0.0.1:port 无法在服务间正常通信。

理论上 --host 0.0.0.0 就是监听本机的所有地址了,服务启动后,你在浏览器使用 http://ip:port 肯定是能正常访问的(注意不要在浏览器使用 http://0.0.0.0:port 去访问)。

fangyinc avatar Oct 27 '23 07:10 fangyinc

默认端口是7860吗,我不加参数启动之后没有报错信息,进程也在,但是7860端口没有处于监听状态,我在浏览器无法访问web

lv-stupidboy avatar Oct 27 '23 08:10 lv-stupidboy

默认端口是7860吗,我不加参数启动之后没有报错信息,进程也在,但是7860端口没有处于监听状态,我在浏览器无法访问web

之前的问题解决了么? 默认端口是5000,具体可以看安装文档

fangyinc avatar Oct 27 '23 10:10 fangyinc

服务应该是正常启动了,但是web端没办法访问 可能是有防火墙或者端口开放的白名单限制了

lv-stupidboy avatar Oct 27 '23 10:10 lv-stupidboy

@lv-stupidboy web无法访问的问题有解决吗?

csunny avatar Nov 27 '23 15:11 csunny

尝试了几次之后服务启动正常了,现在的启动命令 python dbgpt_server.py --port 7860

lv-stupidboy avatar Nov 28 '23 14:11 lv-stupidboy

This issue has been marked as stale, because it has been over 30 days without any activity.

github-actions[bot] avatar Feb 19 '24 21:02 github-actions[bot]

This issue bas been closed, because it has been marked as stale and there has been no activity for over 7 days.

github-actions[bot] avatar Feb 26 '24 21:02 github-actions[bot]