DB-GPT [Bug] llmserver.py 执行至Loading checkpoint shards: 100% 抛错

Search before asking

[X] I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

3.10

DB-GPT version

main

Related scenes

[X] Chat Data
[ ] Chat Excel
[ ] Chat DB
[ ] Chat Knowledge
[ ] Model Management
[ ] Dashboard
[ ] Plugins

Installation Information

Device information

T4 显卡显卡数量：1 显存：15

Models information

vicuna-13b-v1.5("load_in_4bit": true) text2vec-large-chinese

What happened

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:30<00:00, 50.26s/it] /home/miniconda3/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /home/miniconda3/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( 2023-10-13 08:57:01 k161ae pilot.model.loader[4436] INFO Current model is type of: LlamaForCausalLM, load tokenizer by LlamaTokenizer 2023-10-13 08:57:01 k161ae pilot.model.cluster.worker.manager[4436] ERROR Error starting worker manager: expected str, bytes or os.PathLike object, not NoneType 2023-10-13 08:57:01 k161ae asyncio[4436] ERROR Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at /home/DB-GPT/pilot/model/cluster/worker/manager.py:657> exception=SystemExit(1)> Traceback (most recent call last): File "/home/DB-GPT/pilot/model/cluster/worker/manager.py", line 659, in start_worker_manager

What you expected to happen

1、T4 显卡显存不足?

How to reproduce

执行：python /home/DB-GPT/pilot/server/llmserver.py

Additional context

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Oct 13 '23 01:10 gantao21

@gantao21 你好，建议先试试 7B 的模型，不开量化或者开8bit量化，另外拉一下最新 main 分支代码，然后根据最新文档来部署。

如果还有什么问题可以执行命令 dbgpt trace chat导出一下相关信息，方便一起排查。

Oct 13 '23 11:10 fangyinc

请问这个问题解决了吗，V100卡运行的时候也遇到了同样的问题

Oct 27 '23 01:10 lv-stupidboy

请问这个问题解决了吗，V100卡运行的时候也遇到了同样的问题

显存不足的问题，根据我上面的说明试试。

你好，建议先试试 7B 的模型，不开量化或者开8bit量化，另外拉一下最新 main 分支代码，然后根据最新文档来部署。

如果还有什么问题可以执行命令 dbgpt trace chat导出一下相关信息，方便一起排查。

Oct 27 '23 01:10 fangyinc

用vicuna-7b 的可以，百川7B 的不行。

Oct 27 '23 02:10 gantao21

好的，我试一下其他模型比较疑惑的是：V100卡的显存有32G，模型启动的时候只占用了40%，我现在部署的是0.4.0版本，0.3.4版本同样的机器上也能跑13b-v1.5

Oct 27 '23 02:10 lv-stupidboy

好的，我试一下其他模型比较疑惑的是：V100卡的显存有32G，模型启动的时候只占用了40%，我现在部署的是0.4.0版本，0.3.4版本同样的机器上也能跑13b-v1.5

感觉不太应该，可以发一下具体信息么，我看看 dbgpt trace chat 导出的信息。

Oct 27 '23 02:10 fangyinc

+------------------------+------------------------------+-----------------------------+-----------------------------------------------------------------------------+ Config Value (Webserver) | Config Key (EmbeddingModel) | Config Value (EmbeddingModel) | ----+------------------------------+-----------------------------+-----------------------------------------------------------------------------+ | 0.0.0.0 | model_name | text2vec | | 5000 | model_path | DBGPT_v0.4.0/DB-GPT-0.4.0/models/text2vec-large-chinese | | False | device | cuda | None | normalize_embeddings | None | None | | | | False | | | False | | | None | | | | False | | | | dbgpt_webserver.log | | | | dbgpt_webserver_tracer.jsonl | | | ----+------------------------------+-----------------------------+-----------------------------------------------------------------------------+ ------+----------------------------------------------------------------------+----------------------------+----------------------------------------------------------------------+ Config Value (ModelWorker) | Config Key (WorkerManager) | Config Value (WorkerManager) | ------+----------------------------------------------------------------------+----------------------------+----------------------------------------------------------------------+ | vicuna-13b-v1.5 | model_name | vicuna-13b-v1.5 | | /DB-GPT-0.4.0/models/vicuna-13b-v1.5 | model_path | /DBGPT_v0.4.0/DB-GPT-0.4.0/models/vicuna-13b-v1.5 | | cuda | worker_type | None | | huggingface | worker_class | None | None | model_type | huggingface | 4096 | host | 0.0.0.0 | | None | port | 5000 | None | daemon | False | False | limit_model_concurrency | 5 | | True | standalone | True | | False | register | True | | nf4 | worker_register_host | None | True | controller_addr | http://127.0.0.1:5000 | | None | send_heartbeat | True | True | heartbeat_interval | 20 | | False | log_level | None | | | log_file | dbgpt_model_worker_manager.log | | | tracer_file | dbgpt_model_worker_manager_tracer.jsonl | ------+----------------------------------------------------------------------+----------------------------+----------------------------------------------------------------------+ --------------------------------------------------------------------------------+ ModelWorker System information | --------------------------------------------------------------------------------+ System Config Value | --------------------------------------------------------------------------------+ linux | 3.10.13 | Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz | AVX512 | 263601180 kB | 2.0.1+cu117 | cuda | 11.7 | 4 | | name, driver_version, memory.total [MiB], memory.free [MiB], memory.used [MiB] | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | Tesla V100S-PCIE-32GB, 495.29.05, 32510 MiB, 32506 MiB, 4 MiB | | --------------------------------------------------------------------------------+

Oct 27 '23 02:10 lv-stupidboy

@lv-stupidboy 这个是可以正常启动 vicuna-13b-v1.5，但是在一些场景使用中会报显存不足的异常么？我看你已经开了 8bit 量化了，32G 显存单个用户使用不应该出现爆显存的。

Oct 27 '23 03:10 fangyinc

但是现在服务启动时和上面的朋友完全一致的现象，服务启动失败。 Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

Oct 27 '23 03:10 lv-stupidboy

但是现在服务启动时和上面的朋友完全一致的现象，服务启动失败。 Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

目前不是很好复现你的问题，我看你有四张卡，可以考虑全部都利用起来，试试把量化关了，然后强制设置每张卡的最大GPU使用试试。

QUANTIZE_8bit=False
QUANTIZE_4bit=False
MAX_GPU_MEMORY=8Gib

Oct 27 '23 03:10 fangyinc

QUANTIZE_8bit=False
QUANTIZE_4bit=False
CUDA_VISIBLE_DEVICES=0,1,2,3
MAX_GPU_MEMORY=8Gib

@fangyinc 大佬，调整了配置还是一样的现象 envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sampleis set toFalse. However, temperatureis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sampleis set toFalse. However, top_pis set to0.6-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettop_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should befixed. warnings.warn( ERROR [pilot.model.cluster.worker.manager] Error starting worker manager: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed ERROR [asyncio] Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py:758> exception=SystemExit(1)> Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1 INFO [pilot.model.cluster.worker.manager] Stop all workers INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._stop_all_worker.._stop_worker at 0x7f93b03ffe20> INFO [pilot.model.cluster.worker.manager] Apply to all workers WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:0 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:1 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:2 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:3 WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed ERROR: Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1511, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1504, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1377, in uvloop.loop.Loop.run_forever File "uvloop/loop.pyx", line 555, in uvloop.loop.Loop._run File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan await receive() File "envs/dbgpt040/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "envs/dbgpt040/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError `

Oct 27 '23 03:10 lv-stupidboy

QUANTIZE_8bit=False
QUANTIZE_4bit=False
CUDA_VISIBLE_DEVICES=0,1,2,3
MAX_GPU_MEMORY=8Gib
@fangyinc 大佬，调整了配置还是一样的现象 envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sampleis set toFalse. However, temperatureis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sampleis set toFalse. However, top_pis set to0.6-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettop_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should befixed. warnings.warn( ERROR [pilot.model.cluster.worker.manager] Error starting worker manager: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed ERROR [asyncio] Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py:758> exception=SystemExit(1)> Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1 INFO [pilot.model.cluster.worker.manager] Stop all workers INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._stop_all_worker.._stop_worker at 0x7f93b03ffe20> INFO [pilot.model.cluster.worker.manager] Apply to all workers WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:0 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:1 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:2 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:3 WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed ERROR: Traceback (most recent call last): File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1511, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1504, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1377, in uvloop.loop.Loop.run_forever File "uvloop/loop.pyx", line 555, in uvloop.loop.Loop._run File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "envs/dbgpt040/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan await receive() File "envs/dbgpt040/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "envs/dbgpt040/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError `

辛苦发一下其它的错误，在这些通用错误前面应该还有具体的错误原因。

Oct 27 '23 03:10 fangyinc

服务启动打印的日志都在这里了，辛苦大佬看一下 `# python pilot/server/dbgpt_server.py --host xx.xx.xx.xx --port 7860

=========================== WebWerverParameters ===========================

host: xx.xx.xx.xx port: 7860 daemon: False controller_addr: None model_name: None share: False remote_embedding: False log_level: INFO light: False log_file: dbgpt_webserver.log tracer_file: dbgpt_webserver_tracer.jsonl

======================================================================

4e05d94b5799 (head) heads:None INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. Generating DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/meta_data/alembic/versions/91c18e894c6e_dbgpt_ddl_upate.py ... done INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. INFO [alembic.runtime.migration] Running upgrade 4e05d94b5799 -> 91c18e894c6e, dbgpt ddl upate INFO [pilot.model.cluster.worker.embedding_worker] [EmbeddingsModelWorker] Parameters of device is None, use cuda WARNI [sentence_transformers.SentenceTransformer] No sentence-transformers model found with name DBGPT_v0.4.0/DB-GPT-0.4.0/models/text2vec-large-chinese. Creating a new one with MEAN pooling. Model Unified Deployment Mode! INFO: Started server process [1872205] INFO: Waiting for application startup. INFO [pilot.model.cluster.worker.manager] Begin start all worker, apply_req: None INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._start_all_worker.._start_worker at 0x7f93d00a7370> INFO [pilot.model.cluster.worker.manager] Apply to all workers INFO: Application startup complete. INFO [pilot.model.cluster.worker.default_worker] Begin load model, model params:

=========================== ModelParameters ===========================

model_name: vicuna-13b-v1.5 model_path: DBGPT_v0.4.0/DB-GPT-0.4.0/models/vicuna-13b-v1.5 device: cuda model_type: huggingface prompt_template: None max_context_size: 4096 num_gpus: None max_gpu_memory: 8Gib cpu_offloading: False load_8bit: False load_4bit: False quant_type: nf4 use_double_quant: True compute_dtype: None trust_remote_code: True verbose: False

======================================================================

INFO: Uvicorn running on http://xx.xx.xx.xx:7860 (Press CTRL+C to quit) INFO [pilot.model.loader] There has max_gpu_memory from config: 8Gib Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00, 6.27s/it] /envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /envs/dbgpt040/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should befixed. warnings.warn( ERROR [pilot.model.cluster.worker.manager] Error starting worker manager: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed ERROR [asyncio] Task exception was never retrieved future: <Task finished name='Task-3' coro=<_setup_fastapi..startup_event..start_worker_manager() done, defined at DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py:758> exception=SystemExit(1)> Traceback (most recent call last): File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1 INFO [pilot.model.cluster.worker.manager] Stop all workers INFO [pilot.model.cluster.worker.manager] Apply req: None, apply_func: <function LocalWorkerManager._stop_all_worker.._stop_worker at 0x7f93b03ffe20> INFO [pilot.model.cluster.worker.manager] Apply to all workers WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:0 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:1 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:2 INFO [pilot.utils.model_utils] Clear torch cache of device: cuda:3 WARNI [pilot.model.cluster.worker.manager] Stop worker, ignored exception from deregister_func: All connection attempts failed ERROR: Traceback (most recent call last): File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 760, in start_worker_manager await worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 578, in start return await self.worker_manager.start() File "/DBGPT_v0.4.0/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 116, in start raise Exception(out.message) Exception: model vicuna-13b-v1.5@huggingface(xx.xx.xx.xx:7860) start failed, All connection attempts failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/envs/dbgpt040/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1511, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1504, in uvloop.loop.Loop.run_until_complete File "uvloop/loop.pyx", line 1377, in uvloop.loop.Loop.run_forever File "uvloop/loop.pyx", line 555, in uvloop.loop.Loop._run File "uvloop/loop.pyx", line 474, in uvloop.loop.Loop._on_idle File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/DB-GPT-0.4.0/pilot/model/cluster/worker/manager.py", line 763, in start_worker_manager sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/envs/dbgpt040/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan await receive() File "/envs/dbgpt040/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "/envs/dbgpt040/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError `

Oct 27 '23 06:10 lv-stupidboy

@lv-stupidboy 你好，启动命令 python pilot/server/dbgpt_server.py --host xx.xx.xx.xx --port 7860 中的是xx.xx.xx.xx 是做了特殊处理么，我看你是单机模式启动的，你的场景这个参数应该可以不用填，默认是 0.0.0.0 ，表示监听本机的所有ip地址。

Oct 27 '23 06:10 fangyinc

@fangyinc 没有特殊处理， IP我在发出来之前给屏蔽掉了，加上ip和host，是因为在linux机器上启动服务，我需要通过ip:port的模式去访问web服务，用默认参数我在web端访问不到

Oct 27 '23 06:10 lv-stupidboy

@fangyinc 没有特殊处理， IP我在发出来之前给屏蔽掉了，加上ip和host，是因为在linux机器上启动服务，我需要通过ip:port的模式去访问web服务，用默认参数我在web端访问不到

所以不填 --host 参数是能正常启动的么？这个问题应该是跟你的 --host 的ip地址有关系，统一部署的模式下，DB-GPT 在默认会启动多个组件，其中需要一个 http://127.0.0.1:port 地址来通信，这里是由于是你指定的 host 启动，导致http://127.0.0.1:port 无法在服务间正常通信。

理论上 --host 0.0.0.0 就是监听本机的所有地址了，服务启动后，你在浏览器使用 http://ip:port 肯定是能正常访问的（注意不要在浏览器使用 http://0.0.0.0:port 去访问）。

Oct 27 '23 07:10 fangyinc

默认端口是7860吗，我不加参数启动之后没有报错信息，进程也在，但是7860端口没有处于监听状态，我在浏览器无法访问web

Oct 27 '23 08:10 lv-stupidboy

默认端口是7860吗，我不加参数启动之后没有报错信息，进程也在，但是7860端口没有处于监听状态，我在浏览器无法访问web

之前的问题解决了么？默认端口是5000，具体可以看安装文档。

Oct 27 '23 10:10 fangyinc

服务应该是正常启动了，但是web端没办法访问可能是有防火墙或者端口开放的白名单限制了

Oct 27 '23 10:10 lv-stupidboy

@lv-stupidboy web无法访问的问题有解决吗？

Nov 27 '23 15:11 csunny

尝试了几次之后服务启动正常了，现在的启动命令 python dbgpt_server.py --port 7860

Nov 28 '23 14:11 lv-stupidboy

This issue has been marked as stale, because it has been over 30 days without any activity.

Feb 19 '24 21:02 github-actions[bot]

This issue bas been closed, because it has been marked as stale and there has been no activity for over 7 days.

Feb 26 '24 21:02 github-actions[bot]

DB-GPT DB-GPT copied to clipboard

[Bug] llmserver.py 执行至Loading checkpoint shards: 100% 抛错

Search before asking

Operating system information

Python version information

DB-GPT version

Related scenes

Installation Information

Device information

Models information

What happened

What you expected to happen

How to reproduce

Additional context

Are you willing to submit PR?

DB-GPT
DB-GPT copied to clipboard