QAnything icon indicating copy to clipboard operation
QAnything copied to clipboard

[BUG]docker logs 一直提示Triton 正在启动

Open misslxs opened this issue 1 year ago • 5 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

从日志来看所有的服务均启动成功,但curl -s -w "%{http_code}" http://localhost:10000/v2/health/ready -o /dev/null) 检测一直不通过。超时后容器停止后也没有/model_repos/QAEnsemble_base/QAEnsemble_base.log 这个日志文件。

iShot_2024-01-19_09 30 13

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS: ubuntu 22.04 x86
- NVIDIA Driver: 535.146.02
- CUDA:12.2
- Docker Compose:v2.24.0-birthday.10
- NVIDIA GPU Memory:16GB

QAnything日志 | QAnything logs

root@f1376869a3c5:/workspace/qanything_local# cat api.log UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:17 +0800] [91] [INFO] Sanic v23.6.0 [2024-01-19 09:56:17 +0800] [91] [INFO] Goin' Fast @ http://0.0.0.0:8777 [2024-01-19 09:56:17 +0800] [91] [INFO] mode: production, w/ 4 workers [2024-01-19 09:56:17 +0800] [91] [INFO] server: sanic, HTTP/1.1 [2024-01-19 09:56:17 +0800] [91] [INFO] python: 3.10.12 [2024-01-19 09:56:17 +0800] [91] [INFO] platform: Linux-6.5.0-14-generic-x86_64-with-glibc2.35 [2024-01-19 09:56:17 +0800] [91] [INFO] packages: sanic-routing==23.12.0, sanic-ext==23.6.0 UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [658] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [658] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [658] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [658] [INFO] > http [2024-01-19 09:56:27 +0800] [658] [INFO] > templating [jinja2==3.1.3] UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [657] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [657] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [657] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [657] [INFO] > http [2024-01-19 09:56:27 +0800] [657] [INFO] > templating [jinja2==3.1.3] UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [659] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [659] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [659] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [659] [INFO] > http [2024-01-19 09:56:27 +0800] [659] [INFO] > templating [jinja2==3.1.3] init local_doc_qa in local init local_doc_qa in local UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [660] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [660] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [660] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [660] [INFO] > http [2024-01-19 09:56:27 +0800] [660] [INFO] > templating [jinja2==3.1.3] init local_doc_qa in local init local_doc_qa in local [2024-01-19 09:56:27 +0800] [658] [INFO] Starting worker [658] [2024-01-19 09:56:27 +0800] [657] [INFO] Starting worker [657] [2024-01-19 09:56:27 +0800] [659] [INFO] Starting worker [659] [2024-01-19 09:56:27 +0800] [660] [INFO] Starting worker [660]

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

misslxs avatar Jan 19 '24 02:01 misslxs

补充一下,我和楼主同样的问题,我把QAEnsemble.log贴出来。 I0119 02:05:18.197207 86 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9e5c000000' with size 268435456 I0119 02:05:18.201188 86 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0119 02:05:18.208520 86 model_lifecycle.cc:462] loading: rerank:1 I0119 02:05:18.208561 86 model_lifecycle.cc:462] loading: embed:1 I0119 02:05:18.208588 86 model_lifecycle.cc:462] loading: base:1 I0119 02:05:18.211636 86 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime I0119 02:05:18.211702 86 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12 I0119 02:05:18.211721 86 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12 I0119 02:05:18.211736 86 onnxruntime.cc:2550] backend configuration: {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} I0119 02:05:18.277019 86 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: rerank (version 1) I0119 02:05:18.277589 86 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: embed (version 1) I0119 02:05:18.277767 86 onnxruntime.cc:666] skipping model configuration auto-complete for 'rerank': inputs and outputs already specified I0119 02:05:18.278371 86 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: rerank (GPU device 0) I0119 02:05:18.278735 86 onnxruntime.cc:666] skipping model configuration auto-complete for 'embed': inputs and outputs already specified I0119 02:05:18.280363 86 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: embed (GPU device 0) I0119 02:05:18.758885 86 libfastertransformer.cc:459] Before Loading Weights: terminate called after throwing an instance of 'std::length_error' what(): basic_string::_M_create [d46a4f8365f8:00086] *** Process received signal *** [d46a4f8365f8:00086] Signal: Aborted (6) [d46a4f8365f8:00086] Signal code: (-6) [d46a4f8365f8:00086] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f9eab095520] [d46a4f8365f8:00086] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f9eab0e99fc] [d46a4f8365f8:00086] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f9eab095476] [d46a4f8365f8:00086] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f9eab07b7f3] [d46a4f8365f8:00086] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f9eab31db9e] [d46a4f8365f8:00086] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f9eab32920c] [d46a4f8365f8:00086] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f9eab329277] [d46a4f8365f8:00086] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f9eab3294d8] [d46a4f8365f8:00086] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x40)[0x7f9eab320449] [d46a4f8365f8:00086] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x14bc69)[0x7f9eab3c6c69] [d46a4f8365f8:00086] [10] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(+0xa6ba3c)[0x7f9e1dbf2a3c] [d46a4f8365f8:00086] [11] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN17fastertransformer21loadWeightFromBinFuncI6__halfS1_EEiPT_St6vectorImSaImEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x187)[0x7f9e1dc0b227] [d46a4f8365f8:00086] [12] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN17fastertransformer17loadWeightFromBinI6__halfEEiPT_St6vectorImSaImEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_14FtCudaDataTypeE+0x282)[0x7f9e1dc0ed12] [d46a4f8365f8:00086] [13] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN17fastertransformer11LlamaWeightI6__halfE16loadEncryptModelENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x184)[0x7f9e1d7cb0b4] [d46a4f8365f8:00086] [14] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN16LlamaTritonModelI6__halfE19createSharedWeightsEii+0x2ad)[0x7f9e1d7b219d] [d46a4f8365f8:00086] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f9eab357253] [d46a4f8365f8:00086] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f9eab0e7ac3] [d46a4f8365f8:00086] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660)[0x7f9eab179660] [d46a4f8365f8:00086] *** End of error message ***

YinSonglin1997 avatar Jan 19 '24 02:01 YinSonglin1997

Triton服务同样显示启动失败,进入容器内检查/model_repos/QAEnsemble_base/QAEnsemble_base.log 发现:nohup: failed to run command '/opt/tritonserver/bin/tritonserver': No such file or directory

yydxlv avatar Jan 19 '24 03:01 yydxlv

Triton服务同样显示启动失败,进入容器内检查/model_repos/QAEnsemble_base/QAEnsemble_base.log 发现:nohup: failed to run command '/opt/tritonserver/bin/tritonserver': No such file or directory

可以贴出完整的log文件吗?方便排查,另外可以看下FAQ_zh.md,可能存在帮助

xixihahaliu avatar Jan 19 '24 10:01 xixihahaliu

补充一下,我和楼主同样的问题,我把QAEnsemble.log贴出来。 I0119 02:05:18.197207 86 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9e5c000000' with size 268435456 I0119 02:05:18.201188 86 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0119 02:05:18.208520 86 model_lifecycle.cc:462] loading: rerank:1 I0119 02:05:18.208561 86 model_lifecycle.cc:462] loading: embed:1 I0119 02:05:18.208588 86 model_lifecycle.cc:462] loading: base:1 I0119 02:05:18.211636 86 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime I0119 02:05:18.211702 86 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12 I0119 02:05:18.211721 86 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12 I0119 02:05:18.211736 86 onnxruntime.cc:2550] backend configuration: {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} I0119 02:05:18.277019 86 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: rerank (version 1) I0119 02:05:18.277589 86 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: embed (version 1) I0119 02:05:18.277767 86 onnxruntime.cc:666] skipping model configuration auto-complete for 'rerank': inputs and outputs already specified I0119 02:05:18.278371 86 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: rerank (GPU device 0) I0119 02:05:18.278735 86 onnxruntime.cc:666] skipping model configuration auto-complete for 'embed': inputs and outputs already specified I0119 02:05:18.280363 86 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: embed (GPU device 0) I0119 02:05:18.758885 86 libfastertransformer.cc:459] Before Loading Weights: terminate called after throwing an instance of 'std::length_error' what(): basic_string::_M_create [d46a4f8365f8:00086] *** Process received signal *** [d46a4f8365f8:00086] Signal: Aborted (6) [d46a4f8365f8:00086] Signal code: (-6) [d46a4f8365f8:00086] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f9eab095520] [d46a4f8365f8:00086] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f9eab0e99fc] [d46a4f8365f8:00086] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f9eab095476] [d46a4f8365f8:00086] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f9eab07b7f3] [d46a4f8365f8:00086] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f9eab31db9e] [d46a4f8365f8:00086] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f9eab32920c] [d46a4f8365f8:00086] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f9eab329277] [d46a4f8365f8:00086] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f9eab3294d8] [d46a4f8365f8:00086] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x40)[0x7f9eab320449] [d46a4f8365f8:00086] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x14bc69)[0x7f9eab3c6c69] [d46a4f8365f8:00086] [10] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(+0xa6ba3c)[0x7f9e1dbf2a3c] [d46a4f8365f8:00086] [11] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN17fastertransformer21loadWeightFromBinFuncI6__halfS1_EEiPT_St6vectorImSaImEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x187)[0x7f9e1dc0b227] [d46a4f8365f8:00086] [12] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN17fastertransformer17loadWeightFromBinI6__halfEEiPT_St6vectorImSaImEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_14FtCudaDataTypeE+0x282)[0x7f9e1dc0ed12] [d46a4f8365f8:00086] [13] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN17fastertransformer11LlamaWeightI6__halfE16loadEncryptModelENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x184)[0x7f9e1d7cb0b4] [d46a4f8365f8:00086] [14] /opt/tritonserver/backends/qa_ensemble/libqa-shared.so(_ZN16LlamaTritonModelI6__halfE19createSharedWeightsEii+0x2ad)[0x7f9e1d7b219d] [d46a4f8365f8:00086] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f9eab357253] [d46a4f8365f8:00086] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f9eab0e7ac3] [d46a4f8365f8:00086] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660)[0x7f9eab179660] [d46a4f8365f8:00086] *** End of error message ***

  • 原因2:如果发现显存够用,那是因为新版模型与部分显卡型号不兼容。
  • 解决方案:请更换为兼容模型和镜像,手动下载模型文件解压并替换models目录,然后重启服务即可。
    • 将docker-compose-xxx.yaml中的freeren/qanyxxx:v1.0.9改为freeren/qanyxxx:v1.0.8
    • git clone https://www.wisemodel.cn/Netease_Youdao/qanything.git
    • cd qanything
    • git reset --hard 79b3da3bbb35406f0b2da3acfcdb4c96c2837faf
    • unzip models.zip
    • 替换掉现有的models目录

可以尝试上述解决方案,另外部分显卡型号不支持当前模型,请提前确认,在显存足够的前提下,目前已确认支持的显卡包括Nvidia 2080Ti,30系,40系,A30,A40,A100

xixihahaliu avatar Jan 19 '24 11:01 xixihahaliu

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

从日志来看所有的服务均启动成功,但curl -s -w "%{http_code}" http://localhost:10000/v2/health/ready -o /dev/null) 检测一直不通过。超时后容器停止后也没有/model_repos/QAEnsemble_base/QAEnsemble_base.log 这个日志文件。

iShot_2024-01-19_09 30 13

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS: ubuntu 22.04 x86
- NVIDIA Driver: 535.146.02
- CUDA:12.2
- Docker Compose:v2.24.0-birthday.10
- NVIDIA GPU Memory:16GB

QAnything日志 | QAnything logs

root@f1376869a3c5:/workspace/qanything_local# cat api.log UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:17 +0800] [91] [INFO] Sanic v23.6.0 [2024-01-19 09:56:17 +0800] [91] [INFO] Goin' Fast @ http://0.0.0.0:8777 [2024-01-19 09:56:17 +0800] [91] [INFO] mode: production, w/ 4 workers [2024-01-19 09:56:17 +0800] [91] [INFO] server: sanic, HTTP/1.1 [2024-01-19 09:56:17 +0800] [91] [INFO] python: 3.10.12 [2024-01-19 09:56:17 +0800] [91] [INFO] platform: Linux-6.5.0-14-generic-x86_64-with-glibc2.35 [2024-01-19 09:56:17 +0800] [91] [INFO] packages: sanic-routing==23.12.0, sanic-ext==23.6.0 UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [658] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [658] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [658] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [658] [INFO] > http [2024-01-19 09:56:27 +0800] [658] [INFO] > templating [jinja2==3.1.3] UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [657] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [657] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [657] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [657] [INFO] > http [2024-01-19 09:56:27 +0800] [657] [INFO] > templating [jinja2==3.1.3] UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [659] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [659] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [659] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [659] [INFO] > http [2024-01-19 09:56:27 +0800] [659] [INFO] > templating [jinja2==3.1.3] init local_doc_qa in local init local_doc_qa in local UPLOAD_ROOT_PATH: /workspace/qanything_local/QANY_DB/content rerank_port: 10001 embed_port: 10001 [2024-01-19 09:56:27 +0800] [660] [INFO] Sanic Extensions: [2024-01-19 09:56:27 +0800] [660] [INFO] > injection [0 dependencies; 0 constants] [2024-01-19 09:56:27 +0800] [660] [INFO] > openapi [http://0.0.0.0:8777/docs] [2024-01-19 09:56:27 +0800] [660] [INFO] > http [2024-01-19 09:56:27 +0800] [660] [INFO] > templating [jinja2==3.1.3] init local_doc_qa in local init local_doc_qa in local [2024-01-19 09:56:27 +0800] [658] [INFO] Starting worker [658] [2024-01-19 09:56:27 +0800] [657] [INFO] Starting worker [657] [2024-01-19 09:56:27 +0800] [659] [INFO] Starting worker [659] [2024-01-19 09:56:27 +0800] [660] [INFO] Starting worker [660]

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

目前单卡启动和双卡启动的日志文件位置不同,因为单卡启动多个tritonserver服务会同时启动,节省显存,目前看你应该是单卡启动的,请贴出/model_repos/QAEnsemble/QAEnsemble.log的详细内容,这里应该会有更多信息

xixihahaliu avatar Jan 19 '24 11:01 xixihahaliu