部署会出现部署失败但是页面和log文件都没有任何日志提示的情况
Describe the bug 部署会出现部署失败但是页面和log文件都没有任何日志提示的情况
Environment CSGHub Version: v0.12.0 OS: Linux(openEuler 2203版本) Hardware: 16c32g Launch: docker compose 部署 csghub + helm 部署runner服务
页面情况,部署失败,日志为空
runner日志
10.42.0.1 - - [21/Nov/2025:04:14:19 +0000] "GET /api/v1/cluster/f90e1c6c-799d-4c24-a342-ee900e0b4950 HTTP/1.1" 200 441 "-" "Go-http-client/1.1" 343 0.014 [csghub-runner-runner-8082] [] 10.42.0.240:8082 441 0.014 200 b0ba5ff56d8dce7c7771c2a5e6758497
10.42.0.1 - - [21/Nov/2025:04:14:19 +0000] "POST /api/v1/service/f4ragh1harr4/run HTTP/1.1" 200 42 "-" "Go-http-client/1.1" 1358 0.038 [csghub-runner-runner-8082] [] 10.42.0.240:8082 42 0.038 200 801560dee4f3172d0cbfdc872ef66e39
10.42.0.1 - - [21/Nov/2025:04:14:19 +0000] "GET /api/v1/cluster/f90e1c6c-799d-4c24-a342-ee900e0b4950 HTTP/1.1" 200 441 "-" "Go-http-client/1.1" 343 0.011 [csghub-runner-runner-8082] [] 10.42.0.240:8082 441 0.011 200 d26197c3d7e931cd5f79b0cd1a41dd14
10.42.0.1 - - [21/Nov/2025:04:14:19 +0000] "GET /api/v1/service/f4ragh1harr4/replica HTTP/1.1" 200 104 "-" "Go-http-client/1.1" 525 0.002 [csghub-runner-runner-8082] [] 10.42.0.240:8082 104 0.002 200 e538222e7f6b8de205fa9b6c32749db1
10.42.0.1 - - [21/Nov/2025:04:14:19 +0000] "GET /api/v1/service/f4ragh1harr4/get HTTP/1.1" 200 221 "-" "Go-http-client/1.1" 459 0.001 [csghub-runner-runner-8082] [] 10.42.0.240:8082 221 0.001 200 4c2e3b041070890089d18e41f7abffed
10.42.0.1 - - [21/Nov/2025:04:14:25 +0000] "GET /api/v1/service/f4ragh1harr4/get HTTP/1.1" 200 221 "-" "Go-http-client/1.1" 459 0.002 [csghub-runner-runner-8082] [] 10.42.0.240:8082 221 0.002 200 dca2c6ba2a511dd8aa3d932138f0308c
10.42.0.1 - - [21/Nov/2025:04:14:30 +0000] "GET /api/v1/service/f4ragh1harr4/get HTTP/1.1" 200 260 "-" "Go-http-client/1.1" 459 0.002 [csghub-runner-runner-8082] [] 10.42.0.240:8082 260 0.002 200 7a886b7e25a6d64b81baa848de18e289
10.42.0.1 - - [21/Nov/2025:04:14:30 +0000] "GET /api/v1/service/f4ragh1harr4/replica HTTP/1.1" 200 104 "-" "Go-http-client/1.1" 525 0.001 [csghub-runner-runner-8082] [] 10.42.0.240:8082 104 0.001 200 fb4f6b5d95ac1ba1b818da26e08bc126
10.42.0.1 - - [21/Nov/2025:04:14:30 +0000] "GET /api/v1/service/f4ragh1harr4/get HTTP/1.1" 200 260 "-" "Go-http-client/1.1" 459 0.002 [csghub-runner-runner-8082] [] 10.42.0.240:8082 260 0.001 200 25d6228776d5908717a369cbf97f3ec7
10.42.0.1 - - [21/Nov/2025:04:14:35 +0000] "GET /api/v1/service/f4ragh1harr4/get HTTP/1.1" 200 260 "-" "Go-http-client/1.1" 459 0.002 [csghub-runner-runner-8082] [] 10.42.0.240:8082 260 0.002 200 1e7ed06f23aadd06630326466b704499
10.42.0.1 - - [21/Nov/2025:04:14:40 +0000] "GET /api/v1/service/f4ragh1harr4/get HTTP/1.1" 200 260 "-" "Go-http-client/1.1" 459 0.001 [csghub-runner-runner-8082] [] 10.42.0.240:8082 260 0.001 200 8ac1353e82a7a9db980d5ff012ebcf6a
server日志
2025-11-21_04:14:19.10267 {"time":"2025-11-21T04:14:19.102607249Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"}
2025-11-21_04:14:19.10278 {"time":"2025-11-21T04:14:19.102735877Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"POST","latency(ms)":31,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run","full_path":"/api/v1/models/:namespace/:name/run","trace_id":"47f3d81e-162d-4c12-aaee-199a350a60d0"}
2025-11-21_04:14:19.14949 {"time":"2025-11-21T04:14:19.149432572Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/user/63b7c9d1-b177-4116-852a-3ecd9467372d?type=uuid","full_path":"/api/v1/user/:username","trace_id":"35839a94-9b45-4fe2-a4b5-d7da7d6734f7"}
2025-11-21_04:14:19.24657 {"time":"2025-11-21T04:14:19.246470323Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":1,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"77e472f40e9949228166aa59daafa340"}
2025-11-21_04:14:19.24660 {"time":"2025-11-21T04:14:19.246527499Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1763698459,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4ragh1harr4","status":20,"endpoint":"","message":"","reason":"create","task_id":6}}}
2025-11-21_04:14:19.26470 {"time":"2025-11-21T04:14:19.264563141Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"65283d1082f24c98b9859d761884f275"}
2025-11-21_04:14:19.26507 {"time":"2025-11-21T04:14:19.264911781Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1763698459,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4ragh1harr4","status":20,"endpoint":"","message":"","reason":"create","task_id":6}}}
2025-11-21_04:14:19.75873 {"time":"2025-11-21T04:14:19.758653862Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":0,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/broadcasts/active","full_path":"/api/v1/broadcasts/active","trace_id":"2a40d51c-c7d8-4810-8207-7a0a95902fb4"}
2025-11-21_04:14:19.76079 {"time":"2025-11-21T04:14:19.760746585Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":2,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/user/63b7c9d1-b177-4116-852a-3ecd9467372d?type=uuid","full_path":"/api/v1/user/:username","trace_id":"7b3f5854-7ff3-470c-bc2b-a6561eb2d06b"}
2025-11-21_04:14:19.77138 {"time":"2025-11-21T04:14:19.771327313Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":5,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/tags","full_path":"/api/v1/tags","trace_id":"9ef0a2ea-b06b-4fac-b192-e2cd21c3aab9"}
2025-11-21_04:14:19.77182 {"time":"2025-11-21T04:14:19.771782566Z","level":"INFO","msg":"Get space resources successfully"}
2025-11-21_04:14:19.77189 {"time":"2025-11-21T04:14:19.771853984Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":13,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/space_resources?cluster_id=","full_path":"/api/v1/space_resources","trace_id":"662904e7-e91b-45d1-8d72-3ded96caf12e"}
2025-11-21_04:14:19.77551 {"time":"2025-11-21T04:14:19.775471155Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":17,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/2","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"8846928a-70c9-4683-92e8-2083776f9972"}
2025-11-21_04:14:20.06438 {"time":"2025-11-21T04:14:20.064213382Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":1,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/message-types","full_path":"/api/v1/notifications/message-types","trace_id":"417f0b2c-780e-471b-988d-9fb0f133d485"}
2025-11-21_04:14:20.06535 {"time":"2025-11-21T04:14:20.065313124Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":2,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/count","full_path":"/api/v1/notifications/count","trace_id":"0fc8555b-a5b8-45e2-a318-779fec3f1904"}
2025-11-21_04:14:20.07858 {"time":"2025-11-21T04:14:20.078390409Z","level":"INFO","msg":"http request","trace_id":"22b6070c9d244bd29021ba533504607e","method":"GET","url":"http://127.0.0.1:8088/api/v1/namespace/root","status":200,"latency(ms)":1}
2025-11-21_04:14:20.08307 {"time":"2025-11-21T04:14:20.082954499Z","level":"INFO","msg":"Get model succeed","model":"Qwen2.5-0.5B-Instruct"}
2025-11-21_04:14:20.08315 {"time":"2025-11-21T04:14:20.083048404Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":0,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/version","full_path":"/api/v1/version","trace_id":"f4a7d78a-f772-41f3-8181-e3063b67a060"}
2025-11-21_04:14:20.08360 {"time":"2025-11-21T04:14:20.083404138Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":15,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct","full_path":"/api/v1/models/:namespace/:name","trace_id":"8c23ffa0-e3ec-43e0-8b6f-3b216ba37663"}
2025-11-21_04:14:20.08620 {"time":"2025-11-21T04:14:20.086037329Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":3,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/poll/1?timezone=Asia/Shanghai","full_path":"/api/v1/notifications/poll/:limit","trace_id":"1acc6d74-3155-4f62-8e0c-0e84b8c8bc5d"}
2025-11-21_04:14:29.38229 {"time":"2025-11-21T04:14:29.382229348Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":1,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"97872e6f1b2b4261b0d3e67e3291a2e2"}
2025-11-21_04:14:29.38260 {"time":"2025-11-21T04:14:29.382391536Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.change","event_time":1763698469,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4ragh1harr4","status":21,"endpoint":"http://f4ragh1harr4.spaces.app.internal","message":"","reason":"","task_id":6}}}
2025-11-21_04:14:30.09887 {"time":"2025-11-21T04:14:30.098812143Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":7,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/2","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"3750b1d5-ffd7-4714-8fbf-21cb0069d6e7"}
2025-11-21_04:14:33.70744 {"time":"2025-11-21T04:14:33.707384037Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"848b33fda49e44409b2158b972d7e262"}
2025-11-21_04:14:33.70801 {"time":"2025-11-21T04:14:33.707968979Z","level":"INFO","msg":"cluster_event_received","event":{"event_type":"runner.cluster.update","event_time":1763698473,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}}
2025-11-21_04:14:33.70806 {"time":"2025-11-21T04:14:33.708018464Z","level":"INFO","msg":"processing cluster event","event":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}
- 是部署的专属实例吗?
- 在log没有看到创建失败的错误
- 在k8s中使用命令kubectl -n spaces get ksvc看一下服务的状态。
- 使用kubectl -n spaces get po 看一下pod是否创建,再通过kubectl -n spaces logs pod_name看一下pod的启动日志。
- 在页面上看不到log,检查一下loki服务状态,再server的logs中检查有没有loki相关的报错信息。
1、是部署专属实例 3、如下,有,但状态是失败
kubectl -n spaces get ksvc
NAME URL LATESTCREATED LATESTREADY READY REASON
f4ragh1harr4 http://f4ragh1harr4.spaces.app.internal f4ragh1harr4-00001 False RevisionMissing
4、如下
$ kubectl -n spaces get po
No resources found in spaces namespace.
5、loki状态正常 $ csghub-ctl status run: accounting: (pid 1688) 10182s; run: log: (pid 1132) 10183s run: casdoor: (pid 1593) 10182s; run: log: (pid 1117) 10183s run: dataviewer: (pid 2279) 10181s; run: log: (pid 1119) 10183s run: gitaly: (pid 1135) 10183s; run: log: (pid 1116) 10183s run: gitlab_shell: (pid 1154) 10183s; run: log: (pid 1122) 10183s run: loki: (pid 317904) 0s; run: log: (pid 1137) 10183s 搜索整个server的curren日志文件,未发现和loki相关报错
loki服务本身一直在循环报这块日志
2025-11-21_06:29:51.30929 open /var/opt/csghub/loki/tsdb-shipper-active/uploader/name: permission denied
2025-11-21_06:29:51.30936 error initialising module: store
2025-11-21_06:29:51.30937 github.com/grafana/dskit/modules.(*Manager).initModule
2025-11-21_06:29:51.30937 /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
2025-11-21_06:29:51.30938 github.com/grafana/dskit/modules.(*Manager).InitModuleServices
2025-11-21_06:29:51.30938 /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
2025-11-21_06:29:51.30939 github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
2025-11-21_06:29:51.30939 /src/loki/pkg/loki/loki.go:531
2025-11-21_06:29:51.30939 main.main
2025-11-21_06:29:51.30940 /src/loki/cmd/loki/main.go:129
2025-11-21_06:29:51.30940 runtime.main
2025-11-21_06:29:51.30941 /usr/local/go/src/runtime/proc.go:283
2025-11-21_06:29:51.30941 runtime.goexit
2025-11-21_06:29:51.30941 /usr/local/go/src/runtime/asm_amd64.s:1700
2025-11-21_06:29:51.30945 level=info ts=2025-11-21T06:29:51.290986327Z caller=main.go:126 msg="Starting Loki" version="(version=3.5.7, branch=release-3.5.x, revision=d5b382b9)"
2025-11-21_06:29:51.30945 level=info ts=2025-11-21T06:29:51.291155749Z caller=main.go:127 msg="Loading configuration file" filename=/var/opt/csghub/loki/loki-config.yaml
2025-11-21_06:29:51.30945 level=info ts=2025-11-21T06:29:51.302365222Z caller=server.go:368 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
2025-11-21_06:29:51.30945 level=info ts=2025-11-21T06:29:51.308900688Z caller=table_manager.go:136 index-store=tsdb-2025-10-24 msg="uploading tables"
2025-11-21_06:29:51.30946 level=info ts=2025-11-21T06:29:51.30889935Z caller=table_manager.go:300 index-store=tsdb-2025-10-24 msg="query readiness setup completed" duration=3.702µs distinct_users_len=0 distinct_users=
2025-11-21_06:29:51.30948 level=info ts=2025-11-21T06:29:51.309001191Z caller=shipper.go:165 index-store=tsdb-2025-10-24 msg="starting index shipper in RW mode"
2025-11-21_06:29:51.30948 level=error ts=2025-11-21T06:29:51.309254046Z caller=log.go:223 msg="error running loki" err="open /var/opt/csghub/loki/tsdb-shipper-active/uploader/name: permission denied\nerror initialising module: store\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:531\nmain.main\n\t/src/loki/cmd/loki/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:283\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700"
- 检查一下remote runner安装时,logcollector组件是否正常安装运行。
- 在portal上通过启动和停止按钮,重启一下服务,同时观察pod的状态及logs.
- loki这个服务报了一个permission denied,像是个安装配置的目录没有权限。
@MasonXon 看一下loki的部署权限问题。
logcollector 默认被禁用了,因为这个组件需要连接到 loki,可以在 runner chart 启用这个组件
logcollector:
enabled: true
loki:
address: "<your csghub external>/-/loki"
- 检查一下remote runner安装时,logcollector组件是否正常安装运行。
- 在portal上通过启动和停止按钮,重启一下服务,同时观察pod的状态及logs.
- loki这个服务报了一个permission denied,像是个安装配置的目录没有权限。
@MasonXon 看一下loki的部署权限问题。
1、之前没有开启logcollector csghub的docker-compose增加了 - '3100:3100' # Loki ,放开端口后更新 runner服务增加如下配置并升级
logcollector:
enabled: true
loki:
address: "http://10.1.110.47:3100/-/loki"
loki仍然会有前面发的报错,状态也会经常转为 down ,down: loki: 0s, normally up, want up; run: log: (pid 1121) 5162s
$ kubectl get pod -n csghub 查看pod状态,日志服务,无法启动
NAME READY STATUS RESTARTS AGE
runner-ingress-nginx-controller-79c4fc4f6f-rqh56 1/1 Running 0 6h38m
runner-logcollector-7cf9bdc7-8rb69 0/1 Init:0/1 0 84m
runner-logcollector-849ccffdcf-s6lzs 0/1 Init:0/1 0 84m
runner-reloader-55d677d5c5-gzjg6 1/1 Running 0 6h38m
runner-runner-bb57d6d96-ws9jd 1/1 Running 0 84m
查看报错日志报错无法连接loki服务
Connecting to 10.1.110.47:3100 (10.1.110.47:3100)
wget: can't connect to remote host (10.1.110.47): Connection refused
2、暂停再重启服务,pod都是不存在,svc就是状态是false,加上loki异常所以也看不到什么模型直接相关的log
暂停加重启期间server日志如下
2025-11-21_10:14:12.44929 {"time":"2025-11-21T10:14:12.44919019Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}}
2025-11-21_10:14:12.44936 {"time":"2025-11-21T10:14:12.449275975Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":2,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"a86143a511014ff781d4d0b38040ee0a"}
2025-11-21_10:14:12.44938 {"time":"2025-11-21T10:14:12.449276128Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":1,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"b11e01ae24b4497c9bc2b524ff057e33"}
2025-11-21_10:14:12.45519 {"time":"2025-11-21T10:14:12.455142023Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"PUT","latency(ms)":30,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4/stop","full_path":"/api/v1/models/:namespace/:name/run/:id/stop","trace_id":"747982cc-e78a-4cd1-bea0-0aff7cf3d0ba"}
2025-11-21_10:14:13.45053 {"time":"2025-11-21T10:14:13.450415817Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}}
2025-11-21_10:14:14.45251 {"time":"2025-11-21T10:14:14.452393925Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}}
2025-11-21_10:14:15.45441 {"time":"2025-11-21T10:14:15.454313401Z","level":"ERROR","msg":"webhook dispatch a single msg with 3 retries","subject":"webhook.event.runner","msg.data":"{\"event_type\":\"runner.service.stop\",\"event_time\":1763720052,\"cluster_id\":\"f90e1c6c-799d-4c24-a342-ee900e0b4950\",\"runner_name\":\"\",\"data_type\":\"object\",\"data\":{\"service_name\":\"f4s4smpr56o0\",\"status\":26,\"endpoint\":\"\",\"message\":\"\",\"reason\":\"\",\"task_id\":0}}","error":"failed to process webhook event by *executors.kserviceExecutorImpl error: failed to update deploy status in webhook error: failed to get deploy task by task id 0 in webhook error: SYS-ERR-3: sql: no rows in result set"}
2025-11-21_10:14:15.45459 {"time":"2025-11-21T10:14:15.45454208Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}}
2025-11-21_10:14:16.45601 {"time":"2025-11-21T10:14:16.455876989Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}}
2025-11-21_10:14:16.86269 {"time":"2025-11-21T10:14:16.862622869Z","level":"WARN","msg":"fail to get deploy replica with error","req":{"id":4,"org_name":"root","repo_name":"Qwen2.5-0.5B-Instruct","cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","svc_name":"f4s4smpr56o0","need_details":false,"deploy_type":0},"error":"SYS-ERR-1: unexpected http status: 404, error: map[error:service not exist]"}
2025-11-21_10:14:16.86277 {"time":"2025-11-21T10:14:16.862676877Z","level":"WARN","msg":"fail to get deploy replica","repotype":"model","req":{"deploy_id":4,"namespace":"root","name":"Qwen2.5-0.5B-Instruct","status":"","repo_model_id":1,"svc_name":"f4s4smpr56o0","created_at":"0001-01-01T00:00:00Z","updated_at":"0001-01-01T00:00:00Z","cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","private":false},"error":"SYS-ERR-1: unexpected http status: 404, error: map[error:service not exist]"}
2025-11-21_10:14:16.86872 {"time":"2025-11-21T10:14:16.86864478Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":15,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"4e06e938-1c5b-4344-8e1e-327ed7b34b06"}
2025-11-21_10:14:17.45818 {"time":"2025-11-21T10:14:17.458104938Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}}
2025-11-21_10:14:18.45974 {"time":"2025-11-21T10:14:18.459627684Z","level":"ERROR","msg":"webhook dispatch a single msg with 3 retries","subject":"webhook.event.runner","msg.data":"{\"event_type\":\"runner.service.stop\",\"event_time\":1763720052,\"cluster_id\":\"f90e1c6c-799d-4c24-a342-ee900e0b4950\",\"runner_name\":\"\",\"data_type\":\"object\",\"data\":{\"service_name\":\"f4s4smpr56o0\",\"status\":26,\"endpoint\":\"\",\"message\":\"\",\"reason\":\"\",\"task_id\":0}}","error":"failed to process webhook event by *executors.kserviceExecutorImpl error: failed to update deploy status in webhook error: failed to get deploy task by task id 0 in webhook error: SYS-ERR-3: sql: no rows in result set"}
2025-11-21_10:14:19.91945 {"time":"2025-11-21T10:14:19.919389251Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"7359f891752d4b2b834a2b8ea0ca26fb"}
2025-11-21_10:14:19.92012 {"time":"2025-11-21T10:14:19.920035978Z","level":"INFO","msg":"cluster_event_received","event":{"event_type":"runner.cluster.update","event_time":1763720059,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}}
2025-11-21_10:14:19.92016 {"time":"2025-11-21T10:14:19.92011065Z","level":"INFO","msg":"processing cluster event","event":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}
2025-11-21_10:14:20.09469 {"time":"2025-11-21T10:14:20.094493294Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"PUT","latency(ms)":25,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4/start","full_path":"/api/v1/models/:namespace/:name/run/:id/start","trace_id":"62b457a0-639e-40ef-8428-870e35a7a578"}
2025-11-21_10:14:20.16496 {"time":"2025-11-21T10:14:20.164906808Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1763720060,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":20,"endpoint":"","message":"","reason":"create","task_id":13}}}
2025-11-21_10:14:20.16500 {"time":"2025-11-21T10:14:20.164948261Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"9098575baff14e64814efe68281dfbea"}
2025-11-21_10:14:20.18915 {"time":"2025-11-21T10:14:20.189077125Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"b15d030f649847a1802184b8c2688d7e"}
2025-11-21_10:14:20.18922 {"time":"2025-11-21T10:14:20.18919598Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1763720060,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":20,"endpoint":"","message":"","reason":"create","task_id":13}}}
2025-11-21_10:14:21.86676 {"time":"2025-11-21T10:14:21.866704428Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":8,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"fc4c5347-4e10-48f4-81d2-5c20b82ed8a2"}
2025-11-21_10:14:25.43615 {"time":"2025-11-21T10:14:25.436072864Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":1,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/count","full_path":"/api/v1/notifications/count","trace_id":"bd3bab60-35d0-4f01-80a4-f034379eeee2"}
2025-11-21_10:14:25.43728 {"time":"2025-11-21T10:14:25.437224237Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":2,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/poll/1?timezone=Asia/Shanghai","full_path":"/api/v1/notifications/poll/:limit","trace_id":"21816ae6-6123-4eee-b90d-97bf9e918c23"}
2025-11-21_10:14:30.26625 {"time":"2025-11-21T10:14:30.266161677Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":2,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"b2c4575efa5c402eaa9907df292fbf03"}
2025-11-21_10:14:30.26641 {"time":"2025-11-21T10:14:30.266311263Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.change","event_time":1763720070,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":21,"endpoint":"http://f4s4smpr56o0.spaces.app.internal","message":"","reason":"","task_id":13}}}
2025-11-21_10:14:31.87839 {"time":"2025-11-21T10:14:31.878309826Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":9,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"e08dc476-ac60-407d-bad2-32bc8015f1c5"}
runer-runner服务日志如下
{"time":"2025-11-21T10:14:12.446629757Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"}
{"time":"2025-11-21T10:14:12.447715955Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"}
{"time":"2025-11-21T10:14:12.447744091Z","level":"INFO","msg":"service deleted by request","req":{"id":4,"org_name":"root","repo_name":"Qwen2.5-0.5B-Instruct","cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","svc_name":"f4s4smpr56o0"}}
{"time":"2025-11-21T10:14:12.447799614Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"POST","latency(ms)":18,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/stop","full_path":"/api/v1/service/:service/stop","trace_id":"8509859971599aa2eb66cfc0bf06776e"}
{"time":"2025-11-21T10:14:12.449769908Z","level":"INFO","msg":"http request","trace_id":"a86143a511014ff781d4d0b38040ee0a","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":3}
{"time":"2025-11-21T10:14:12.449791027Z","level":"INFO","msg":"http request","trace_id":"b11e01ae24b4497c9bc2b524ff057e33","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2}
{"time":"2025-11-21T10:14:12.453274146Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"52fcaeb3e3f8cdac59b79d969a6225a5"}
{"time":"2025-11-21T10:14:16.84329755Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":5,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"41179393d8edd1316ea680c4ca8acdb1"}
{"time":"2025-11-21T10:14:16.862222732Z","level":"ERROR","msg":"service not exist"}
{"time":"2025-11-21T10:14:16.862284574Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":4,"status":404,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"743c4bcd9f0cff5c3201fa5f12d7fff8"}
{"time":"2025-11-21T10:14:16.867647659Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"e20cd1762fa2f1cbc93f1aa61cc26382"}
{"time":"2025-11-21T10:14:19.913232637Z","level":"INFO","msg":"webhook endpoint is updated","cluster":"config","endpoint":"http://10.1.110.47"}
{"time":"2025-11-21T10:14:19.917477397Z","level":"WARN","msg":"kourier-system/kourier service does not have external IP and try to read clusterIP","ingress":null}
{"time":"2025-11-21T10:14:19.917514343Z","level":"INFO","msg":"kourier-system/kourier service does not have external IP and use clusterIP","clusterIP":"10.43.169.116"}
{"time":"2025-11-21T10:14:19.917529606Z","level":"INFO","msg":"report_event_configmap_update","event":{"event_type":"runner.cluster.update","event_time":1763720059,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}}
{"time":"2025-11-21T10:14:19.919748893Z","level":"INFO","msg":"http request","trace_id":"7359f891752d4b2b834a2b8ea0ca26fb","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2}
{"time":"2025-11-21T10:14:20.085330081Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":10,"status":200,"current_user":"","auth_type":"","url":"/api/v1/cluster/f90e1c6c-799d-4c24-a342-ee900e0b4950","full_path":"/api/v1/cluster/:id","trace_id":"6bfcf5b758fa9e74724ed9edfba1fed3"}
{"time":"2025-11-21T10:14:20.090637777Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"d36009de02b7a96d0d0723b64d1b67a0"}
W1121 10:14:20.162506 1 warnings.go:70] Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
{"time":"2025-11-21T10:14:20.163617818Z","level":"INFO","msg":"service created successfully","svc_name":"f4s4smpr56o0","deploy_id":4}
{"time":"2025-11-21T10:14:20.163685903Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"POST","latency(ms)":21,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/run","full_path":"/api/v1/service/:service/run","trace_id":"cbeb0c0facee587ef527e12293fe5005"}
{"time":"2025-11-21T10:14:20.165344472Z","level":"INFO","msg":"http request","trace_id":"9098575baff14e64814efe68281dfbea","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2}
{"time":"2025-11-21T10:14:20.187049801Z","level":"ERROR","msg":"failed to get deployment by svc name","service":"f4s4smpr56o0","error":"fail to get deployment list by selector serving.knative.dev/service=f4s4smpr56o0, error: %!w(<nil>)"}
{"time":"2025-11-21T10:14:20.187680058Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"}
{"time":"2025-11-21T10:14:20.189521487Z","level":"INFO","msg":"http request","trace_id":"b15d030f649847a1802184b8c2688d7e","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2}
{"time":"2025-11-21T10:14:21.849864031Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":1,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"c34d565c3934db22c7b1f6717b96efe3"}
{"time":"2025-11-21T10:14:21.86296852Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"bfdeba3c42264a51aaa183f8d1f0897f"}
{"time":"2025-11-21T10:14:21.865671743Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"62844ef2db93474fa691ec8a64a84788"}
{"time":"2025-11-21T10:14:26.854545649Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"0da78ad23931bf750dd5019dee3a7e61"}
{"time":"2025-11-21T10:14:30.261769556Z","level":"ERROR","msg":"failed to get deployment ","service":"f4s4smpr56o0","error":"fail to get deployment list by selector serving.knative.dev/service=f4s4smpr56o0, error: %!w(<nil>)"}
{"time":"2025-11-21T10:14:30.262390037Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"}
{"time":"2025-11-21T10:14:30.266739225Z","level":"INFO","msg":"http request","trace_id":"b2c4575efa5c402eaa9907df292fbf03","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":4}
{"time":"2025-11-21T10:14:31.859760025Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"136adb941b1c997be5ec063ddca23e46"}
{"time":"2025-11-21T10:14:31.874009424Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"595c6245bcaebb645f547d064230a40e"}
{"time":"2025-11-21T10:14:31.877233608Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":1,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"f3264b02985621c0560536445be8d672"}
没看出来有什么有价值的东西 现在我想有两个点最关键 1、loki为什么会有权限问题,如何才能正常启动记录日志 2、部署服务时应该要下载docker镜像编译并推送到本地docker仓库,然后再进行部署吧。这个过程是发生在csghub中还是runner-runner服务中?
-
从server中的这行error发现task_id=0,需要检查ksvc在创建后,是否在annotation中有task_id的值。 2025-11-21_10:14:15.45441 {"time":"2025-11-21T10:14:15.454313401Z","level":"ERROR","msg":"webhook dispatch a single msg with 3 retries","subject":"webhook.event.runner","msg.data":"{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}","error":"failed to process webhook event by *executors.kserviceExecutorImpl error: failed to update deploy status in webhook error: failed to get deploy task by task id 0 in webhook error: SYS-ERR-3: sql: no rows in result set"}
-
runner的log中warn,发现log收集组件没有工作 {"time":"2025-11-21T10:14:30.262390037Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"
-
还是要看一下为什么loki没有启动,查看看它的logs. loki为什么会有权限问题,会不会是目录的权限问题?
-
部署服务的编译过程是在k8s中启动的,检查一下有没有安装k8s argo组件。k8s需要knative, argo. a. 在服务部署后,立即使用kubectl -n spaces get po 看会有一个sib名字开头的pod运行build, 查询logs. b. 在ksvc创建后,使用kubectl -n spaces get po 查询对应的pod的logs, 查询是否运行及logs.
- 检查一下remote runner安装时,logcollector组件是否正常安装运行。
- 在portal上通过启动和停止按钮,重启一下服务,同时观察pod的状态及logs.
- loki这个服务报了一个permission denied,像是个安装配置的目录没有权限。
@MasonXon 看一下loki的部署权限问题。
1、之前没有开启logcollector csghub的docker-compose增加了 - '3100:3100' # Loki ,放开端口后更新 runner服务增加如下配置并升级
logcollector: enabled: true loki: address: "http://10.1.110.47:3100/-/loki"loki仍然会有前面发的报错,状态也会经常转为 down ,down: loki: 0s, normally up, want up; run: log: (pid 1121) 5162s
$ kubectl get pod -n csghub 查看pod状态,日志服务,无法启动
NAME READY STATUS RESTARTS AGE runner-ingress-nginx-controller-79c4fc4f6f-rqh56 1/1 Running 0 6h38m runner-logcollector-7cf9bdc7-8rb69 0/1 Init:0/1 0 84m runner-logcollector-849ccffdcf-s6lzs 0/1 Init:0/1 0 84m runner-reloader-55d677d5c5-gzjg6 1/1 Running 0 6h38m runner-runner-bb57d6d96-ws9jd 1/1 Running 0 84m查看报错日志报错无法连接loki服务
Connecting to 10.1.110.47:3100 (10.1.110.47:3100) wget: can't connect to remote host (10.1.110.47): Connection refused2、暂停再重启服务,pod都是不存在,svc就是状态是false,加上loki异常所以也看不到什么模型直接相关的log
暂停加重启期间server日志如下
2025-11-21_10:14:12.44929 {"time":"2025-11-21T10:14:12.44919019Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}} 2025-11-21_10:14:12.44936 {"time":"2025-11-21T10:14:12.449275975Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":2,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"a86143a511014ff781d4d0b38040ee0a"} 2025-11-21_10:14:12.44938 {"time":"2025-11-21T10:14:12.449276128Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":1,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"b11e01ae24b4497c9bc2b524ff057e33"} 2025-11-21_10:14:12.45519 {"time":"2025-11-21T10:14:12.455142023Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"PUT","latency(ms)":30,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4/stop","full_path":"/api/v1/models/:namespace/:name/run/:id/stop","trace_id":"747982cc-e78a-4cd1-bea0-0aff7cf3d0ba"} 2025-11-21_10:14:13.45053 {"time":"2025-11-21T10:14:13.450415817Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}} 2025-11-21_10:14:14.45251 {"time":"2025-11-21T10:14:14.452393925Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}} 2025-11-21_10:14:15.45441 {"time":"2025-11-21T10:14:15.454313401Z","level":"ERROR","msg":"webhook dispatch a single msg with 3 retries","subject":"webhook.event.runner","msg.data":"{\"event_type\":\"runner.service.stop\",\"event_time\":1763720052,\"cluster_id\":\"f90e1c6c-799d-4c24-a342-ee900e0b4950\",\"runner_name\":\"\",\"data_type\":\"object\",\"data\":{\"service_name\":\"f4s4smpr56o0\",\"status\":26,\"endpoint\":\"\",\"message\":\"\",\"reason\":\"\",\"task_id\":0}}","error":"failed to process webhook event by *executors.kserviceExecutorImpl error: failed to update deploy status in webhook error: failed to get deploy task by task id 0 in webhook error: SYS-ERR-3: sql: no rows in result set"} 2025-11-21_10:14:15.45459 {"time":"2025-11-21T10:14:15.45454208Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}} 2025-11-21_10:14:16.45601 {"time":"2025-11-21T10:14:16.455876989Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}} 2025-11-21_10:14:16.86269 {"time":"2025-11-21T10:14:16.862622869Z","level":"WARN","msg":"fail to get deploy replica with error","req":{"id":4,"org_name":"root","repo_name":"Qwen2.5-0.5B-Instruct","cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","svc_name":"f4s4smpr56o0","need_details":false,"deploy_type":0},"error":"SYS-ERR-1: unexpected http status: 404, error: map[error:service not exist]"} 2025-11-21_10:14:16.86277 {"time":"2025-11-21T10:14:16.862676877Z","level":"WARN","msg":"fail to get deploy replica","repotype":"model","req":{"deploy_id":4,"namespace":"root","name":"Qwen2.5-0.5B-Instruct","status":"","repo_model_id":1,"svc_name":"f4s4smpr56o0","created_at":"0001-01-01T00:00:00Z","updated_at":"0001-01-01T00:00:00Z","cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","private":false},"error":"SYS-ERR-1: unexpected http status: 404, error: map[error:service not exist]"} 2025-11-21_10:14:16.86872 {"time":"2025-11-21T10:14:16.86864478Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":15,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"4e06e938-1c5b-4344-8e1e-327ed7b34b06"} 2025-11-21_10:14:17.45818 {"time":"2025-11-21T10:14:17.458104938Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}} 2025-11-21_10:14:18.45974 {"time":"2025-11-21T10:14:18.459627684Z","level":"ERROR","msg":"webhook dispatch a single msg with 3 retries","subject":"webhook.event.runner","msg.data":"{\"event_type\":\"runner.service.stop\",\"event_time\":1763720052,\"cluster_id\":\"f90e1c6c-799d-4c24-a342-ee900e0b4950\",\"runner_name\":\"\",\"data_type\":\"object\",\"data\":{\"service_name\":\"f4s4smpr56o0\",\"status\":26,\"endpoint\":\"\",\"message\":\"\",\"reason\":\"\",\"task_id\":0}}","error":"failed to process webhook event by *executors.kserviceExecutorImpl error: failed to update deploy status in webhook error: failed to get deploy task by task id 0 in webhook error: SYS-ERR-3: sql: no rows in result set"} 2025-11-21_10:14:19.91945 {"time":"2025-11-21T10:14:19.919389251Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"7359f891752d4b2b834a2b8ea0ca26fb"} 2025-11-21_10:14:19.92012 {"time":"2025-11-21T10:14:19.920035978Z","level":"INFO","msg":"cluster_event_received","event":{"event_type":"runner.cluster.update","event_time":1763720059,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}} 2025-11-21_10:14:19.92016 {"time":"2025-11-21T10:14:19.92011065Z","level":"INFO","msg":"processing cluster event","event":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}} 2025-11-21_10:14:20.09469 {"time":"2025-11-21T10:14:20.094493294Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"PUT","latency(ms)":25,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4/start","full_path":"/api/v1/models/:namespace/:name/run/:id/start","trace_id":"62b457a0-639e-40ef-8428-870e35a7a578"} 2025-11-21_10:14:20.16496 {"time":"2025-11-21T10:14:20.164906808Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1763720060,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":20,"endpoint":"","message":"","reason":"create","task_id":13}}} 2025-11-21_10:14:20.16500 {"time":"2025-11-21T10:14:20.164948261Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"9098575baff14e64814efe68281dfbea"} 2025-11-21_10:14:20.18915 {"time":"2025-11-21T10:14:20.189077125Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"b15d030f649847a1802184b8c2688d7e"} 2025-11-21_10:14:20.18922 {"time":"2025-11-21T10:14:20.18919598Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1763720060,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":20,"endpoint":"","message":"","reason":"create","task_id":13}}} 2025-11-21_10:14:21.86676 {"time":"2025-11-21T10:14:21.866704428Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":8,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"fc4c5347-4e10-48f4-81d2-5c20b82ed8a2"} 2025-11-21_10:14:25.43615 {"time":"2025-11-21T10:14:25.436072864Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":1,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/count","full_path":"/api/v1/notifications/count","trace_id":"bd3bab60-35d0-4f01-80a4-f034379eeee2"} 2025-11-21_10:14:25.43728 {"time":"2025-11-21T10:14:25.437224237Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":2,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/poll/1?timezone=Asia/Shanghai","full_path":"/api/v1/notifications/poll/:limit","trace_id":"21816ae6-6123-4eee-b90d-97bf9e918c23"} 2025-11-21_10:14:30.26625 {"time":"2025-11-21T10:14:30.266161677Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":2,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"b2c4575efa5c402eaa9907df292fbf03"} 2025-11-21_10:14:30.26641 {"time":"2025-11-21T10:14:30.266311263Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.change","event_time":1763720070,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":21,"endpoint":"http://f4s4smpr56o0.spaces.app.internal","message":"","reason":"","task_id":13}}} 2025-11-21_10:14:31.87839 {"time":"2025-11-21T10:14:31.878309826Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":9,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/4","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"e08dc476-ac60-407d-bad2-32bc8015f1c5"}runer-runner服务日志如下
{"time":"2025-11-21T10:14:12.446629757Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"} {"time":"2025-11-21T10:14:12.447715955Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"} {"time":"2025-11-21T10:14:12.447744091Z","level":"INFO","msg":"service deleted by request","req":{"id":4,"org_name":"root","repo_name":"Qwen2.5-0.5B-Instruct","cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","svc_name":"f4s4smpr56o0"}} {"time":"2025-11-21T10:14:12.447799614Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"POST","latency(ms)":18,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/stop","full_path":"/api/v1/service/:service/stop","trace_id":"8509859971599aa2eb66cfc0bf06776e"} {"time":"2025-11-21T10:14:12.449769908Z","level":"INFO","msg":"http request","trace_id":"a86143a511014ff781d4d0b38040ee0a","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":3} {"time":"2025-11-21T10:14:12.449791027Z","level":"INFO","msg":"http request","trace_id":"b11e01ae24b4497c9bc2b524ff057e33","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2} {"time":"2025-11-21T10:14:12.453274146Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"52fcaeb3e3f8cdac59b79d969a6225a5"} {"time":"2025-11-21T10:14:16.84329755Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":5,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"41179393d8edd1316ea680c4ca8acdb1"} {"time":"2025-11-21T10:14:16.862222732Z","level":"ERROR","msg":"service not exist"} {"time":"2025-11-21T10:14:16.862284574Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":4,"status":404,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"743c4bcd9f0cff5c3201fa5f12d7fff8"} {"time":"2025-11-21T10:14:16.867647659Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"e20cd1762fa2f1cbc93f1aa61cc26382"} {"time":"2025-11-21T10:14:19.913232637Z","level":"INFO","msg":"webhook endpoint is updated","cluster":"config","endpoint":"http://10.1.110.47"} {"time":"2025-11-21T10:14:19.917477397Z","level":"WARN","msg":"kourier-system/kourier service does not have external IP and try to read clusterIP","ingress":null} {"time":"2025-11-21T10:14:19.917514343Z","level":"INFO","msg":"kourier-system/kourier service does not have external IP and use clusterIP","clusterIP":"10.43.169.116"} {"time":"2025-11-21T10:14:19.917529606Z","level":"INFO","msg":"report_event_configmap_update","event":{"event_type":"runner.cluster.update","event_time":1763720059,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","cluster_config":"config","region":"cn-north-1","zone":"","provider":"","enable":false,"storage_class":"","status":"Running","endpoint":"http://runner.trainpla.local:30080","network_interface":"","mode":"incluster","app_endpoint":"http://10.43.169.116"}}} {"time":"2025-11-21T10:14:19.919748893Z","level":"INFO","msg":"http request","trace_id":"7359f891752d4b2b834a2b8ea0ca26fb","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2} {"time":"2025-11-21T10:14:20.085330081Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":10,"status":200,"current_user":"","auth_type":"","url":"/api/v1/cluster/f90e1c6c-799d-4c24-a342-ee900e0b4950","full_path":"/api/v1/cluster/:id","trace_id":"6bfcf5b758fa9e74724ed9edfba1fed3"} {"time":"2025-11-21T10:14:20.090637777Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"d36009de02b7a96d0d0723b64d1b67a0"} W1121 10:14:20.162506 1 warnings.go:70] Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile {"time":"2025-11-21T10:14:20.163617818Z","level":"INFO","msg":"service created successfully","svc_name":"f4s4smpr56o0","deploy_id":4} {"time":"2025-11-21T10:14:20.163685903Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"POST","latency(ms)":21,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/run","full_path":"/api/v1/service/:service/run","trace_id":"cbeb0c0facee587ef527e12293fe5005"} {"time":"2025-11-21T10:14:20.165344472Z","level":"INFO","msg":"http request","trace_id":"9098575baff14e64814efe68281dfbea","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2} {"time":"2025-11-21T10:14:20.187049801Z","level":"ERROR","msg":"failed to get deployment by svc name","service":"f4s4smpr56o0","error":"fail to get deployment list by selector serving.knative.dev/service=f4s4smpr56o0, error: %!w(<nil>)"} {"time":"2025-11-21T10:14:20.187680058Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"} {"time":"2025-11-21T10:14:20.189521487Z","level":"INFO","msg":"http request","trace_id":"b15d030f649847a1802184b8c2688d7e","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":2} {"time":"2025-11-21T10:14:21.849864031Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":1,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"c34d565c3934db22c7b1f6717b96efe3"} {"time":"2025-11-21T10:14:21.86296852Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"bfdeba3c42264a51aaa183f8d1f0897f"} {"time":"2025-11-21T10:14:21.865671743Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"62844ef2db93474fa691ec8a64a84788"} {"time":"2025-11-21T10:14:26.854545649Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"0da78ad23931bf750dd5019dee3a7e61"} {"time":"2025-11-21T10:14:30.261769556Z","level":"ERROR","msg":"failed to get deployment ","service":"f4s4smpr56o0","error":"fail to get deployment list by selector serving.knative.dev/service=f4s4smpr56o0, error: %!w(<nil>)"} {"time":"2025-11-21T10:14:30.262390037Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"} {"time":"2025-11-21T10:14:30.266739225Z","level":"INFO","msg":"http request","trace_id":"b2c4575efa5c402eaa9907df292fbf03","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":4} {"time":"2025-11-21T10:14:31.859760025Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"136adb941b1c997be5ec063ddca23e46"} {"time":"2025-11-21T10:14:31.874009424Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"595c6245bcaebb645f547d064230a40e"} {"time":"2025-11-21T10:14:31.877233608Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":1,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f4s4smpr56o0/get","full_path":"/api/v1/service/:service/get","trace_id":"f3264b02985621c0560536445be8d672"}没看出来有什么有价值的东西 现在我想有两个点最关键 1、loki为什么会有权限问题,如何才能正常启动记录日志 2、部署服务时应该要下载docker镜像编译并推送到本地docker仓库,然后再进行部署吧。这个过程是发生在csghub中还是runner-runner服务中?
loki权限问题我昨天做了修复,是不同操作系统中兼容性问题导致的
你这边的部署我理解主要问题可能在以下两个方面:
- runner 构建镜像时无法正常连接到容器镜像仓库,(bug导致,已做修复,存在概率registry存储桶创建失败)
- 部署ksvc时拉取镜像失败,因为csghub registry是insecure的,当k3s尝试从这个镜像仓库拉取镜像时会有HTTPS信任问题,这个问题可以参考下quick_install.sh的脚本,这里面用的k3s实现的快速部署,看下里面配置的k3s的insecure registry的部分。我记得是生成了一个yaml文件,让k3s加载
@MasonXon 1、loki的问题,因为我是用docker-compose部署的csgub 0.12.0-ce 版本,你更新的是同一个镜像版本么,我将本地的镜像删除再重新拉取再重新部署即可?
你这边的部署我理解主要问题可能在以下两个方面:
- runner 构建镜像时无法正常连接到容器镜像仓库,(bug导致,已做修复,存在概率registry存储桶创建失败)
- 部署ksvc时拉取镜像失败,因为csghub registry是insecure的,当k3s尝试从这个镜像仓库拉取镜像时会有HTTPS信任问题,这个问题可以参考下quick_install.sh的脚本,这里面用的k3s实现的快速部署,看下里面配置的k3s的insecure registry的部分。我记得是生成了一个yaml文件,让k3s加载
2、这里第2点我用的是我公司的镜像库,insecure: "false" ,它的域名是支持https的,是否还需要按你说的处理步骤做处理。 3、部署模型时应该需要从公网拉取基础镜像再将模型传入镜像打包好,再发布到本地镜像仓库,再部署到k8s中吧。这个操作,是否需要csghub服务或runner服务访问外网?我们公司默认是锁网络的,需要知道哪块服务需要连接外网做对应处理。
根据检索到的内容,我无法直接回答您关于特定版本镜像更新、公司镜像库配置以及外网访问需求的问题。
对于您提到的具体技术问题:
- 关于loki问题和镜像更新,建议您联系技术支持确认修复版本
- 关于公司镜像库的HTTPS配置,需要根据您的具体环境进行验证
- 关于外网访问需求,模型部署过程中确实可能涉及从公网拉取基础镜像,具体哪些服务需要外网访问权限需要详细的技术评估
您可以访问 https://opencsg.com/docs 获取更多文档信息,技术支持团队将会介入协助解决您的具体部署问题。
- 从server中的这行error发现task_id=0,需要检查ksvc在创建后,是否在annotation中有task_id的值。 2025-11-21_10:14:15.45441 {"time":"2025-11-21T10:14:15.454313401Z","level":"ERROR","msg":"webhook dispatch a single msg with 3 retries","subject":"webhook.event.runner","msg.data":"{"event_type":"runner.service.stop","event_time":1763720052,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f4s4smpr56o0","status":26,"endpoint":"","message":"","reason":"","task_id":0}}","error":"failed to process webhook event by *executors.kserviceExecutorImpl error: failed to update deploy status in webhook error: failed to get deploy task by task id 0 in webhook error: SYS-ERR-3: sql: no rows in result set"}
- runner的log中warn,发现log收集组件没有工作 {"time":"2025-11-21T10:14:30.262390037Z","level":"WARN","msg":"Log entry dropped because log collector is not ready"
- 还是要看一下为什么loki没有启动,查看看它的logs. loki为什么会有权限问题,会不会是目录的权限问题?
- 部署服务的编译过程是在k8s中启动的,检查一下有没有安装k8s argo组件。k8s需要knative, argo. a. 在服务部署后,立即使用kubectl -n spaces get po 看会有一个sib名字开头的pod运行build, 查询logs. b. 在ksvc创建后,使用kubectl -n spaces get po 查询对应的pod的logs, 查询是否运行及logs.
1、task_id如下 $ kubectl -n spaces get ksvc f4s4smpr56o0 -o yaml|grep task task_id: "14" csghub_deploy_task_id: "14" 2和3、loki启动异常问题 你们另外一个同事已回复是脚本问题已修复,这里不做进一步分析 4、knative, argo 都按你们部署文档里的指令查看过状态,无异常,至于a、b两点细节,通过 kubectl -n spaces get po 看不到任何pod,即使是部署后立刻使用指令看也一样,所以无进一步信息。
- 镜像可以直接替换成v1.12.1-ce
- 如果是受信任的 registry,需要
insecure: false - 模型部署时,会拉取部署模型的框架的镜像,例如 llama.cpp 的镜像,这个需要访问外网
- 镜像可以直接替换成v1.12.1-ce
- 如果是受信任的 registry,需要
insecure: false- 模型部署时,会拉取部署模型的框架的镜像,例如 llama.cpp 的镜像,这个需要访问外网
ok,前两项理解,第3点,从外网拉取 llama.cpp 的镜像 这个行为发生在 csghub服务还是runner服务
runner 服务
@HaiHui886 @MasonXon 升级docker-compose镜像版本并配置runner服务外网访问权限后,loki能正常运行不报错了,检查runner内部可以访问外网了,但仍然无法正常部署大模型实例(svc有但ready是false,pod没有),部署页面的日志 tab页也没有任何内容。问题可能出在哪里?是否需要清理掉0.12.0-ce版本的历史文件再重新部署才能跑?
$ kubectl -n spaces get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON f565cny9iio0 http://f565cny9iio0.spaces.app.internal f565cny9iio0-00001 False RevisionMissing $ kubectl -n spaces get pods No resources found in spaces namespace.
server日志如下
2025-11-25_08:10:31.63897 {"time":"2025-11-25T08:10:31.638896794Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"POST","latency(ms)":29,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run","full_path":"/api/v1/models/:namespace/:name/run","trace_id":"cf05a98c-3688-4eaa-8043-eecb9d15569f"}
2025-11-25_08:10:31.66010 {"time":"2025-11-25T08:10:31.659993253Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"GET","latency(ms)":3,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/user/63b7c9d1-b177-4116-852a-3ecd9467372d?type=uuid","full_path":"/api/v1/user/:username","trace_id":"e0710a2a-2061-4243-8cb4-3096f3766711"}
2025-11-25_08:10:31.77389 {"time":"2025-11-25T08:10:31.773802094Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":2,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"a72590bbe3d646098d44b9f1214f6c2f"}
2025-11-25_08:10:31.77400 {"time":"2025-11-25T08:10:31.773811154Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1764058231,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f565cny9iio0","status":20,"endpoint":"","message":"","reason":"create","task_id":20}}}
2025-11-25_08:10:31.78152 {"time":"2025-11-25T08:10:31.781462747Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":0,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"3d85dc67fe7d467495678b6fe35bbb9c"}
2025-11-25_08:10:31.78403 {"time":"2025-11-25T08:10:31.783973722Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.create","event_time":1764058231,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f565cny9iio0","status":20,"endpoint":"","message":"","reason":"create","task_id":20}}}
2025-11-25_08:10:31.97747 {"time":"2025-11-25T08:10:31.977404819Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":1,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/message-types","full_path":"/api/v1/notifications/message-types","trace_id":"95997c87-ab0d-4a8d-a32e-7ed780f635f6"}
2025-11-25_08:10:31.97777 {"time":"2025-11-25T08:10:31.977733529Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":1,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/count","full_path":"/api/v1/notifications/count","trace_id":"218212b2-dbbc-4539-94b1-8b54a6b95f38"}
2025-11-25_08:10:31.98178 {"time":"2025-11-25T08:10:31.981713979Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":5,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/tags","full_path":"/api/v1/tags","trace_id":"3a3c7963-8358-4985-b253-d5a3104e5fb6"}
2025-11-25_08:10:31.98319 {"time":"2025-11-25T08:10:31.983152166Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":7,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/broadcasts/active","full_path":"/api/v1/broadcasts/active","trace_id":"ef8e2754-f22a-49c5-a942-89660811fc2c"}
2025-11-25_08:10:31.98378 {"time":"2025-11-25T08:10:31.983736708Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":0,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/version","full_path":"/api/v1/version","trace_id":"f1ebf74e-6876-45e0-8e3c-5b110518bd4d"}
2025-11-25_08:10:31.98575 {"time":"2025-11-25T08:10:31.985703183Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":2,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/notifications/poll/1?timezone=Asia/Shanghai","full_path":"/api/v1/notifications/poll/:limit","trace_id":"e6bb898b-b35f-4b5e-accb-0b772982848e"}
2025-11-25_08:10:31.99119 {"time":"2025-11-25T08:10:31.991147059Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":15,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/7","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"29b61715-628e-4977-b8a3-e13ac5f93c2a"}
2025-11-25_08:10:31.99227 {"time":"2025-11-25T08:10:31.992247506Z","level":"INFO","msg":"Get space resources successfully"}
2025-11-25_08:10:31.99234 {"time":"2025-11-25T08:10:31.992316037Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":16,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/space_resources?cluster_id=","full_path":"/api/v1/space_resources","trace_id":"cd3ce2cd-238c-4a99-8f26-cd32ea9af0f6"}
2025-11-25_08:10:32.21389 {"time":"2025-11-25T08:10:32.213832964Z","level":"INFO","msg":"http request","trace_id":"fe93f9aa15ee45c885aedce5acd3a1da","method":"GET","url":"http://127.0.0.1:8088/api/v1/namespace/root","status":200,"latency(ms)":2}
2025-11-25_08:10:32.21756 {"time":"2025-11-25T08:10:32.217504522Z","level":"INFO","msg":"Get model succeed","model":"Qwen2.5-0.5B-Instruct"}
2025-11-25_08:10:32.21788 {"time":"2025-11-25T08:10:32.21782988Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":13,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct","full_path":"/api/v1/models/:namespace/:name","trace_id":"58c69e18-e15a-4c1e-afd8-76b67f547c2a"}
2025-11-25_08:10:41.89728 {"time":"2025-11-25T08:10:41.897206022Z","level":"INFO","msg":"http request","ip":"172.23.0.1","method":"POST","latency(ms)":2,"status":200,"current_user":"","auth_type":"ApiKey","url":"/api/v1/webhook/runner","full_path":"/api/v1/webhook/runner","trace_id":"c30b45b28fff4d1ea18f7376eb43bb57"}
2025-11-25_08:10:41.89733 {"time":"2025-11-25T08:10:41.897252304Z","level":"INFO","msg":"deploy_event_received","event":{"event_type":"runner.service.change","event_time":1764058241,"cluster_id":"f90e1c6c-799d-4c24-a342-ee900e0b4950","runner_name":"","data_type":"object","data":{"service_name":"f565cny9iio0","status":21,"endpoint":"http://f565cny9iio0.spaces.app.internal","message":"","reason":"","task_id":20}}}
2025-11-25_08:10:42.23467 {"time":"2025-11-25T08:10:42.234589669Z","level":"INFO","msg":"http request","ip":"10.11.9.137","method":"GET","latency(ms)":7,"status":200,"current_user":"root","auth_type":"JWT","url":"/api/v1/models/root/Qwen2.5-0.5B-Instruct/run/7","full_path":"/api/v1/models/:namespace/:name/run/:id","trace_id":"3bad2f39-0887-44e5-a87c-141ea9aa4e8b"}
runner日志如下,有两个报错日志
{"time":"2025-11-25T08:10:31.633660906Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":11,"status":200,"current_user":"","auth_type":"","url":"/api/v1/cluster/f90e1c6c-799d-4c24-a342-ee900e0b4950","full_path":"/api/v1/cluster/:id","trace_id":"13219eae458e483eef23e1ec7313d67d"}
W1125 08:10:31.769688 1 warnings.go:70] Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
{"time":"2025-11-25T08:10:31.77125013Z","level":"INFO","msg":"service created successfully","svc_name":"f565cny9iio0","deploy_id":7}
{"time":"2025-11-25T08:10:31.771323124Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"POST","latency(ms)":26,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f565cny9iio0/run","full_path":"/api/v1/service/:service/run","trace_id":"5c37b9145670e9b35662b6bb141f595f"}
{"time":"2025-11-25T08:10:31.774282364Z","level":"INFO","msg":"http request","trace_id":"a72590bbe3d646098d44b9f1214f6c2f","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":3}
{"time":"2025-11-25T08:10:31.779940899Z","level":"ERROR","msg":"failed to get deployment by svc name","service":"f565cny9iio0","error":"fail to get deployment list by selector serving.knative.dev/service=f565cny9iio0, error: %!w(<nil>)"}
{"time":"2025-11-25T08:10:31.781755806Z","level":"INFO","msg":"http request","trace_id":"3d85dc67fe7d467495678b6fe35bbb9c","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":1}
{"time":"2025-11-25T08:10:31.987541256Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f565cny9iio0/replica","full_path":"/api/v1/service/:service/replica","trace_id":"d0d87a92911fa55eaa096ffa706f1815"}
{"time":"2025-11-25T08:10:31.990281252Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f565cny9iio0/get","full_path":"/api/v1/service/:service/get","trace_id":"d2d21376197006dc12b2a571e215bad6"}
{"time":"2025-11-25T08:10:31.991808602Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":12,"status":200,"current_user":"","auth_type":"","url":"/api/v1/cluster/f90e1c6c-799d-4c24-a342-ee900e0b4950","full_path":"/api/v1/cluster/:id","trace_id":"137686d6b6fa9686b2e97e9ddf556747"}
{"time":"2025-11-25T08:10:37.21328223Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":1,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f565cny9iio0/get","full_path":"/api/v1/service/:service/get","trace_id":"cd9a5f5e5dfb73e20282a41c29f793ed"}
{"time":"2025-11-25T08:10:41.892840217Z","level":"ERROR","msg":"failed to get deployment ","service":"f565cny9iio0","error":"fail to get deployment list by selector serving.knative.dev/service=f565cny9iio0, error: %!w(<nil>)"}
{"time":"2025-11-25T08:10:41.897728336Z","level":"INFO","msg":"http request","trace_id":"c30b45b28fff4d1ea18f7376eb43bb57","method":"POST","url":"http://10.1.110.47/api/v1/webhook/runner","status":200,"latency(ms)":4}
{"time":"2025-11-25T08:10:42.218587265Z","level":"INFO","msg":"http request","ip":"10.42.0.1","method":"GET","latency(ms)":0,"status":200,"current_user":"","auth_type":"","url":"/api/v1/service/f565cny9iio0/get","full_path":"/api/v1/service/:service/get","trace_id":"37fcc0e3ee09156514d6cfc5dbec109d"}
检查runner已经通外网
$ kubectl exec -it runner-runner-684c9c546c-4mlgp -n csghub bash -- curl https://www.baidu.com
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
[appops@linwps-test01 csghub]$ sudo crictl logs $(sudo crictl ps | grep runner | awk 'NR==1 {print $1}')
同时发现 gitlab_shell 服务状态异常,并一直报错
root@d21dd760b643:/var/log/csghub/gitlab_shell# csghub-ctl status
run: accounting: (pid 2090) 84599s; run: log: (pid 1501) 84601s
run: casdoor: (pid 1998) 84599s; run: log: (pid 1669) 84601s
run: dataviewer: (pid 2827) 84598s; run: log: (pid 1495) 84601s
run: gitaly: (pid 1672) 84601s; run: log: (pid 1668) 84601s
down: gitlab_shell: 1s, normally up, want up; run: log: (pid 1498) 84601
报错日志如下
root@d21dd760b643:/var/log/csghub/gitlab_shell# csghub-ctl tail gitlab_shell
2025-11-25_08:19:00.04622 Using existing Host Keys
2025-11-25_08:19:00.07087 {"error":"open /var/log/csghub/gitlab_shell/gitlab-shell.log: permission denied","level":"warning","log_file":"stdout","msg":"Unable to configure logging, falling back to STDOUT","time":"2025-11-25T08:19:00Z"}
2025-11-25_08:19:00.07137 {"error":"open /var/opt/csghub/gitlab_shell/ssh/ssh_host_rsa_key: permission denied","filename":"/var/opt/csghub/gitlab_shell/ssh/ssh_host_rsa_key","level":"error","msg":"Failed to read host key","time":"2025-11-25T08:19:00Z"}
2025-11-25_08:19:00.07140 {"error":"open /var/opt/csghub/gitlab_shell/ssh/ssh_host_ecdsa_key: permission denied","filename":"/var/opt/csghub/gitlab_shell/ssh/ssh_host_ecdsa_key","level":"error","msg":"Failed to read host key","time":"2025-11-25T08:19:00Z"}
2025-11-25_08:19:00.07147 {"error":"open /var/opt/csghub/gitlab_shell/ssh/ssh_host_ed25519_key: permission denied","filename":"/var/opt/csghub/gitlab_shell/ssh/ssh_host_ed25519_key","level":"error","msg":"Failed to read host key","time":"2025-11-25T08:19:00Z"}
2025-11-25_08:19:00.07152 {"error":"no host keys could be loaded, aborting","level":"fatal","msg":"Failed to start GitLab built-in sshd","time":"2025-11-25T08:19:00Z"}
/var/log/csghub/gitlab_shell目录内容如下
root@d21dd760b643:/var/log/csghub/gitlab_shell# ll -a
total 9824
drwxr-xr-x 2 root root 4096 Nov 25 08:23 ./
drwxr-xr-x 25 root root 4096 Nov 19 07:31 ../
-rwxr--r-- 1 root root 999190 Nov 25 06:17 '@40000000692549ff1d198fe4.s'*
-rwxr--r-- 1 root root 999190 Nov 25 06:31 '@4000000069254d4807eb0bd4.s'*
-rwxr--r-- 1 root root 999190 Nov 25 06:45 '@40000000692550902321bc7c.s'*
-rwxr--r-- 1 root root 999190 Nov 25 06:59 '@40000000692553d8270f90ac.s'*
-rwxr--r-- 1 root root 999190 Nov 25 07:13 '@40000000692557211e4a1104.s'*
-rwxr--r-- 1 root root 999190 Nov 25 07:27 '@4000000069255a67355c166c.s'*
-rwxr--r-- 1 root root 999190 Nov 25 07:41 '@4000000069255db111ac4cb4.s'*
-rwxr--r-- 1 root root 999190 Nov 25 07:55 '@40000000692560fb369113fc.s'*
-rwxr--r-- 1 root root 999190 Nov 25 08:09 '@400000006925644630e130f4.s'*
-rwxr--r-- 1 root root 999190 Nov 25 08:23 '@400000006925678f0a6b8c94.s'*
-rw-r--r-- 1 root root 17332 Nov 25 08:23 current
-rw------- 1 root root 0 Nov 19 07:31 lock
Based on the retrieved content, the issue appears to be related to multiple problems:
-
GitLab Shell Service Failure: The gitlab_shell service is down due to permission denied errors when trying to access host key files and log files. The logs show:
- "open /var/opt/csghub/gitlab_shell/ssh/ssh_host_rsa_key: permission denied"
- "open /var/log/csghub/gitlab_shell/gitlab-shell.log: permission denied"
- "no host keys could be loaded, aborting"
-
Runner Service Issues: The runner logs show errors "failed to get deployment by svc name" and "failed to get deployment" with service "f565cny9iio0", indicating the Knative service deployment is not being created properly.
-
Kubernetes Resources Missing: The
kubectl -n spaces get podsshows no resources found, and the ksvc shows "RevisionMissing" as the reason for not being ready.
The gitlab_shell service permission issues are likely preventing proper SSH functionality, which may be affecting the overall deployment process. The runner's inability to get deployment information suggests the Knative components may not be functioning correctly.
You can find more information at https://opencsg.com/docs and technical support will get involved.
-
从runner的这个行日志可以看出,knative service已经创建成功。 {"time":"2025-11-25T08:10:31.77125013Z","level":"INFO","msg":"service created successfully","svc_name":"f565cny9iio0","deploy_id":7}
-
接下来就要看kubectl -n spaces get ksvc -o yaml看一下ksvc的log. 看看这里的log有没有error.
-
正常情况下ksvc创建后,kubectl -n spaces get po就应该能看到pod正在创建。
-
检查一下kantive组件的所有pod是否都是正常运行。kubectl get po --all-namespace看一下k8s中的所有 pod, 有没有失败的。
- 从runner的这个行日志可以看出,knative service已经创建成功。 {"time":"2025-11-25T08:10:31.77125013Z","level":"INFO","msg":"service created successfully","svc_name":"f565cny9iio0","deploy_id":7}
- 接下来就要看kubectl -n spaces get ksvc -o yaml看一下ksvc的log. 看看这里的log有没有error.
- 正常情况下ksvc创建后,kubectl -n spaces get po就应该能看到pod正在创建。
- 检查一下kantive组件的所有pod是否都是正常运行。kubectl get po --all-namespace看一下k8s中的所有 pod, 有没有失败的。
2、kubectl -n spaces get ksvc -o yaml 的 statu部分如下,我目前在宿主机能正常下载 docker.io/opencsghq/vllm-cpu:2.4 这个镜像,从这个报错来看是否说明下载镜像的操作发生在 kantive 中或者 k3s集群本身?这两者的外网访问出了问题? status: conditions: - lastTransitionTime: "2025-11-25T08:10:41Z" message: 'Revision "f565cny9iio0-00001" failed with message: Unable to fetch image "docker.io/opencsghq/vllm-cpu:2.4": failed to resolve image to digest: Get "https://index.docker.io/v2/": context deadline exceeded.' reason: RevisionFailed status: "False" type: ConfigurationsReady - lastTransitionTime: "2025-11-25T08:10:41Z" message: Configuration "f565cny9iio0" does not have any ready Revision. reason: RevisionMissing status: "False" type: Ready - lastTransitionTime: "2025-11-25T08:10:41Z" message: Configuration "f565cny9iio0" does not have any ready Revision. reason: RevisionMissing status: "False" type: RoutesReady latestCreatedRevisionName: f565cny9iio0-00001 observedGeneration: 1 url: http://f565cny9iio0.spaces.app.internal
4、没有失败的pod
$ kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-5c69cb69db-qkmgq 1/1 Running 2 (6d5h ago) 7d23h
argo workflow-controller-ccbd949dc-dv7nh 1/1 Running 2 (6d5h ago) 7d23h
csghub runner-ingress-nginx-controller-79c4fc4f6f-rqh56 1/1 Running 0 4d5h
csghub runner-logcollector-5b4b6b68b8-zgxr6 1/1 Running 0 90m
csghub runner-reloader-55d677d5c5-gzjg6 1/1 Running 0 4d5h
csghub runner-runner-684c9c546c-4mlgp 1/1 Running 0 90m
knative-serving activator-845f9dd49c-mvmqr 1/1 Running 2 (6d5h ago) 7d5h
knative-serving autoscaler-f4bf5f95f-mgmdt 1/1 Running 2 (6d5h ago) 7d5h
knative-serving autoscaler-hpa-9c59465cd-rg42n 1/1 Running 2 (6d5h ago) 7d5h
knative-serving controller-755589886f-qjsgw 1/1 Running 2 (6d5h ago) 7d5h
knative-serving net-kourier-controller-85fdd7d8d9-j9jdt 1/1 Running 2 (6d5h ago) 7d5h
knative-serving webhook-5958896b8-55jmg 1/1 Running 2 (6d5h ago) 7d5h
kourier-system 3scale-kourier-gateway-9f4bdf747-j87lv 1/1 Running 2 (6d5h ago) 7d5h
kube-system coredns-64fd4b4794-cl5b8 1/1 Running 2 (6d5h ago) 7d23h
kube-system local-path-provisioner-774c6665dc-rjtn2 1/1 Running 2 (6d5h ago) 7d23h
kube-system metrics-server-7bfffcd44-5wgm6 1/1 Running 2 (6d5h ago) 7d23h
lws-system lws-controller-manager-7867b688c-czhc8 1/1 Running 2 (6d5h ago) 7d23h
lws-system lws-controller-manager-7867b688c-f5rw8 1/1 Running 2 (6d5h ago) 7d23h
image "docker.io/opencsghq/vllm-cpu:2.4": failed to resolve image to digest:
从这行error log来看,是从hub.docker.com去下载image了,正确的应该是从我们的registry下载才对如下地址。 opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/vllm-cpu:2.4
应该是你的runner在部署的时候,STARHUB_SERVER_MODEL_DOCKER_REG_BASE 这个环境变量配置不正确。
如果runner使用的是配置文件config.toml的话,查看一下这个配置文件中的配置 [model] docker_reg_base = "opencsg-registry.cn-beijing.cr.aliyuncs.com"
@caizhenghao
我看了下这个参数默认用的就是ACR的地址,是修改默认配置了吗?
@HaiHui886 @MasonXon 你们说的是下面这个配置么,server csghub/data/server/config.toml |grep docker_reg_base 里面的配置看起来是对的 $ cat csghub/data/server/config.toml |grep docker_reg_base docker_reg_base = "opencsg-registry.cn-beijing.cr.aliyuncs.com/"
我目前正在排查为什么宿主机都下载好镜像,并且k3s也能看到,但是却不使用 $ sudo k3s crictl images IMAGE TAG IMAGE ID SIZE docker.io/opencsghq/vllm-cpu 2.4 6d1660c2043e5 1.5GB
目前AI给出比较靠谱的一个解释如下: 根本原因:Knative ALWAYS 去远程 registry 解析镜像 digest 即使镜像已经在本地 containerd 中存在,Knative 在创建 Revision 时仍然会: 访问 registry(docker.io) 将镜像 tag(2.4)解析成 digest 再把 digest 写入 Revision 之后运行时才使用本地镜像 因此,只要 k3s 所在机器无法访问 docker.io,Knative 必定报错。 这和普通 Pod 不同,Knative 是“严格的镜像验证机制”。
所以原因是不是镜像地址不应该是 docker.io/opencsghq/vllm-cpu:2.4 而应该是 opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/vllm-cpu:2.4 ,这样才能解决问题?
@caizhenghao 不是server的配置,是runner启动时使用的环境变量配置或配置文件中的内容。
我知道咋回事了,这个镜像不应该是检测 csghub docker 这边的。 因为咱们的环境属于 docker csghub + runner helm chart,所以控制下载镜像的镜像仓库参数在 runner chart 这边,所以应该设置如下参数:
--set modelRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com"
所以你应该更新下 chart。
提示: 上面的参数将在下一个 release 修改为如下:
--set model.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"
我知道咋回事了,这个镜像不应该是检测 csghub docker 这边的。 因为咱们的环境属于 docker csghub + runner helm chart,所以控制下载镜像的镜像仓库参数在 runner chart 这边,所以应该设置如下参数:
--set modelRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com" 所以你应该更新下 chart。
提示: 上面的参数将在下一个 release 修改为如下:
--set model.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"
这个配置虽然确实会影响镜像全名为 opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/vllm-cpu:2.4 了,但没有完全解决问题。
后面使用指令 kubectl -n knative-serving edit configmap config-deployment 修改如下配置解决拉不到包的问题。这个设置是让knative绕过到docker hub获取tag换算digest的操作。
registries-skipping-tag-resolving: docker.io,registry-1.docker.io,index.docker.io,opencsg-registry.cn-beijing.cr.aliyuncs.com,opencsghq,runner.trainpla.local:30080
修改上面的knative配置后部署可以出pod了,但是最后部署仍然是失败的
但是他发布失败了,kubectl -n spaces get ksvc -o yaml 查看到的status如下 status: conditions: - lastTransitionTime: "2025-11-26T08:15:55Z" status: Unknown type: ConfigurationsReady - lastTransitionTime: "2025-11-26T08:15:55Z" message: Configuration "f59qadcbv3sw" is waiting for a Revision to become ready. reason: RevisionMissing status: Unknown type: Ready - lastTransitionTime: "2025-11-26T08:15:55Z" message: Configuration "f59qadcbv3sw" is waiting for a Revision to become ready. reason: RevisionMissing status: Unknown type: RoutesReady latestCreatedRevisionName: f59qadcbv3sw-00001 observedGeneration: 1 url: http://f59qadcbv3sw.spaces.app.internal
查看pods情况如下 $ kubectl -n spaces get pods NAME READY STATUS RESTARTS AGE f59qadcbv3sw-00001-deployment-5ff7fd68d7-cplvf 1/2 Running 3 (115s ago) 15m
我是继续使用k3s的工具查看节点情况如下 $ sudo crictl ps |grep f59qadcbv3sw-00001
92638e5984f16 6d1660c2043e5 3 minutes ago Running user-container 2 a685e4fe157da f59qadcbv3sw-00001-deployment-5ff7fd68d7-cplvf spaces
fbe424c9f392e a0fe87f09f1f1 12 minutes ago Running queue-proxy 0 a685e4fe157da f59qadcbv3sw-00001-deployment-5ff7fd68d7-cplvf spaces
使用k3s的工具打印日志如下
$ sudo crictl logs 92638e5984f16
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 491, in _make_request
raise new_e
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 616, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 207, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7f0c601df7c0>, 'Connection to hub.opencsg.com timed out. (connect timeout=None)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='hub.opencsg.com', port=443): Max retries exceeded with url: /hf/api/models/root/Qwen2.5-0.5B-Instruct/revision/main (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f0c601df7c0>, 'Connection to hub.opencsg.com timed out. (connect timeout=None)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/etc/csghub/entry.py", line 10, in <module>
snapshot_download(REPO_ID, cache_dir=DOWNLOAD_DIR, endpoint=ENDPOINT, token=TOKEN)
File "/usr/local/lib/python3.10/dist-packages/pycsghub/snapshot_download.py", line 60, in snapshot_download
repo_info = utils.get_repo_info(repo_id,
File "/usr/local/lib/python3.10/dist-packages/pycsghub/utils.py", line 154, in get_repo_info
return method(
File "/usr/local/lib/python3.10/dist-packages/pycsghub/utils.py", line 338, in model_info
r = requests.get(path, headers=headers, timeout=timeout, params=params)
File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 507, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='hub.opencsg.com', port=443): Max retries exceeded with url: /hf/api/models/root/Qwen2.5-0.5B-Instruct/revision/main (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f0c601df7c0>, 'Connection to hub.opencsg.com timed out. (connect timeout=None)'))
这个问题我理解应该是去访问 hub.opencsg.com 拉取大模型文件了?为什么会这样,不应该使用本地库么,是否和我前面说的 “同时发现 gitlab_shell 服务状态异常,并一直报错” 有关系。
应该如何进一步排查。
检查一下server的配置:
- 使用环境变量的话是STARHUB_SERVER_MODEL_DOWNLOAD_ENDPOINT
- 使用config.toml的话是 [model] download_endpoint = "https://hub.opencsg.com" 把这个值改成真实环境下的portal地址
- 你用的是vllm-cpu image, 部署的是哪个model?
这里是个默认配置错误,目前已经 fix,可以临时修改下启动 omnibus-csghub 时:
server:
download_endpoint: "<external_url>"
根据检索到的内容,没有找到与您提到的"默认配置错误"或修改server.download_endpoint配置相关的具体信息。
您可以访问 https://opencsg.com/docs 获取更多文档信息,技术支持人员将会介入处理。
@caizhenghao 因为 docker 版本才恢复更新不就,这里应该是测试遗漏了
@MasonXon 我按你说的配置在 docker-compose.yml 中增加了 download_endpoint 配置
$ cat docker-compose.yml
version: '3.3' # 建议加上 version 字段,以明确使用的版本
services:
csghub:
container_name: csghub
image: opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/omnibus-csghub:v1.12.1-ce
extra_hosts:
- "runner.trainpla.local:10.1.110.47"
privileged: true
environment:
CSGHUB_OMNIBUS_CONFIG: |
csghub:
external_url: "http://10.1.110.47"
server:
download_endpoint: "http://10.1.110.47"
但是我发现 docker-compose up -d 后 csghub/data/server/config.toml 中的配置仍然不对,如下 [model] deploy_timeout_in_min = 60 download_endpoint = "https://hub.opencsg.com"
这个情况下直接部署模型会出现 如下报错,地址问题仍然存在
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='hub.opencsg.com', port=443): Max retries exceeded with url: /hf/api/models/root/Qwen2.5-0.5B-Instruct/revision/main (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7faa1a1ab7c0>, 'Connection to hub.opencsg.com timed out. (connect timeout=None)'))
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
直接修改 config.toml 则在docker-compose restart 或 csghub-ctl reconfigure 指令后被覆盖回 https://hub.opencsg.com 目前看不明白要怎么搞才生效了
@HaiHui886 你用的是vllm-cpu image, 部署的是哪个model? —— 我主要是先跑通部署流程做初步调研,跑的是一个0.5b的qwen模型,公司卡资源紧张,明年才会采购一批国产卡
@caizhenghao 给container加一个全局的env STARHUB_SERVER_MODEL_DOWNLOAD_ENDPOINT试试