chaosblade-operator icon indicating copy to clipboard operation
chaosblade-operator copied to clipboard

<POD NETWORK DELAY> : cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2

Open ddd1123 opened this issue 2 years ago • 16 comments

Issue Description

Type: bug report

Describe what happened (or what feature you want)

  1. 在 chaosblade-box 中通过 agent 获取 K8s 集群信息,进行 POD NETWORK DELAY 演练;
  2. 报错内容为:原因: /opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms: cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2
  3. 我尝试在对应的 Node 主机上输入 /opt/chaosblade/bin/nsexec,同样报 -bash: /opt/chaosblade/bin/nsexec: No such file or directory

Describe what you expected to happen

希望可以提供相关解决方法or解决思路,thanks!

How to reproduce it (as minimally and precisely as possible)

Tell us your environment

K8s:v1.18.18 chaosblade-box:v1.0.1 chaos-agent:v1.0.0 chaos-operator:v1.6.0 chaos-tool:v1.6.0

Anything else we need to know?

机器执行信息: {

"response": {

"code": 54000,
"error": "unexpected status, expected status: `create`, but the real status: `Error`, please wait!",
"result": {
  "error": "`/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms`: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2",
  "statuses": [
    {
      "error": "`/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms`: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2",
      "kind": "pod",
      "state": "Error",
      "success": false
    }
  ],
  "success": false,
  "uid": "2c29793a6c8da651"
},
"success": false

} }

ddd1123 avatar Aug 10 '22 02:08 ddd1123

但是我做了一个pod-process-kill的演练:

  1. 由于填错了signal所以报错,通过报错信息Reason: /opt/chaosblade/bin/nsexec -t 5165 -p -m -- /bin/sh -c kill -128 128: cmd exec failed, err: /bin/sh: line 0: kill: 128: invalid signal specification exit status 1。我发现也是用的/opt/chaosblade/bin/nsexec
  2. 后续我更改了正确的signal后,该pod-process-kill演练成功了。

由此我感觉上述的问题是不是不在于说在node主机上输入/opt/chaosblade/bin/nsexec的报错呢? 那问题出现在哪里呢。。

ddd1123 avatar Aug 10 '22 03:08 ddd1123

补充:在进行pod-network-delay的演练时:

  1. 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
  2. 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
  3. 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。

ddd1123 avatar Aug 10 '22 03:08 ddd1123

operator日志: 演练进行时节选: time="2022-08-11T01:41:38Z" level=info msg="experiment identifiers: [{{ docker 7420e6a5bff3 centos-tc centos-tc-demo-88d8ff5f8-vg278 192.168.0.4 centos-tc-demo} /opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker 0 chaosblade-tool-ljpcv chaosblade chaosblade-tool}]" experiment=c6c75e78f1a37877 time="2022-08-11T01:41:38Z" level=info msg="execute identifier: {ContainerObjectMeta:{Id: ContainerRuntime:docker ContainerId:7420e6a5bff3 ContainerName:centos-tc PodName:centos-tc-demo-88d8ff5f8-vg278 NodeName:192.168.0.4 Namespace:centos-tc-demo} Command:/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker Error: Code:0 ChaosBladePodName:chaosblade-tool-ljpcv ChaosBladeNamespace:chaosblade ChaosBladeContainerName:chaosblade-tool}" experiment=c6c75e78f1a37877 time="2022-08-11T01:41:38Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-11T01:41:38Z" level=info msg="get err message" command="[/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker]" container=chaosblade-tool err="{"code":63063,"success":false,"error":"/opt/chaosblade/bin/nsexec -t 5165 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2"}" out= podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-11T01:41:38Z" level=error msg="pods/exec: k8s exec failed, err: {"code":63063,"success":false,"error":"/opt/chaosblade/bin/nsexec -t 5165 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2"}\n" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=

ddd1123 avatar Aug 11 '22 01:08 ddd1123

2. kernel-modules-extra

yum install -y kernel-modules-extra 可以安装该模块,问题似乎是由于 pod 内关于 linux 内核流控工具 tc 引起的相关问题,由于内核默认缺少 netem 流控队列,所以会报错 Error: Specified qdisc not found.

但安装该模块不一定能解决 RTNETLINK answers: No such file or directory exit status 2 该问题,可以先尝试安装kernel-modules-extra该模块看是否能解决问题

Icesource avatar Aug 11 '22 09:08 Icesource

补充:在进行pod-network-delay的演练时:

  1. 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
  2. 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
  3. 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。

你系统什么版本 centos 8.x 吗,这个包 8.x 需要安装,并且安装后还要重启机器

tiny-x avatar Aug 11 '22 09:08 tiny-x

  1. kernel-modules-extra

yum install -y kernel-modules-extra 可以安装该模块,问题似乎是由于 pod 内关于 linux 内核流控工具 tc 引起的相关问题,由于内核默认缺少 netem 流控队列,所以会报错 Error: Specified qdisc not found.

但安装该模块不一定能解决 RTNETLINK answers: No such file or directory exit status 2 该问题,可以先尝试安装kernel-modules-extra该模块看是否能解决问题

好嘞,我再尝试安装下此模块,但是的确之前尝试安装遇到了问题,我尝试的yum源均提示没有此包可以安装。。 另外我在其他问题上看到有说在1.6.x以后不会用到pod内的tc了,这是真的嘛

ddd1123 avatar Aug 11 '22 09:08 ddd1123

是的,你先确认下你内核版本和发行版本吧

tiny-x avatar Aug 11 '22 09:08 tiny-x

补充:在进行pod-network-delay的演练时:

  1. 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
  2. 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
  3. 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。

你系统什么版本 centos 8.x 吗,这个包 8.x 需要安装,并且安装后还要重启机器

我的系统是CentOS Linux release 7.6.1810 (Core) 我是自己拉取的一个docker镜像,然后通过yum -y install iproute装上了tc命令

ddd1123 avatar Aug 11 '22 09:08 ddd1123

4.18.0-193.el8.x86_64

ddd1123 avatar Aug 11 '22 09:08 ddd1123

当前进展:

1、我的系统今天进行了一次变更。现在的版本是CentOS Linux release 8.2.2004 (Core) 2、随即成功安装了kernel-modules-extra包 3、<POD NETWORK DELAY>演练成功了!,但是恢复阶段报错,报错信息如下

信息: { "response": { "code": 54000, "error": "unexpected status, expected status: destroy, but the real status: Destroying, please wait!", "result": { "error": "pods/exec: k8s exec failed, err: command terminated with exit code 126", "statuses": [ { "error": "pods/exec: k8s exec failed, err: command terminated with exit code 126", "id": "fff09b30e7e8f4a2", "kind": "pod", "state": "Error", "success": false } ], "success": false, "uid": "f9e03fa41bc7c31f" }, "success": false } } 错误:原因: pods/exec: k8s exec failed, err: command terminated with exit code 126 排查:场景状态不匹配,请稍后再试

日志节选: time="2022-08-17T07:16:20Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status fff09b30e7e8f4a2]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-17T07:16:20Z" level=info msg="get output message" command="[/opt/chaosblade/blade status fff09b30e7e8f4a2]" container=chaosblade-tool err= out="{"code":200,"success":true,"result":{"Uid":"fff09b30e7e8f4a2","Command":"cri","SubCommand":"network delay","Flag":" --offset=5 --container-id=3e1db8dce103 --timeout=125 --container-runtime=docker --time=60 --interface=eth0","Status":"Success","Error":"","CreateTime":"2022-08-17T07:13:45.153329761Z","UpdateTime":"2022-08-17T07:13:45.181644002Z"}}" podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-17T07:16:20Z" level=error msg="pods/exec: k8s exec failed, err: {"code":200,"success":true,"result":{"Uid":"fff09b30e7e8f4a2","Command":"cri","SubCommand":"network delay","Flag":" --offset=5 --container-id=3e1db8dce103 --timeout=125 --container-runtime=docker --time=60 --interface=eth0","Status":"Success","Error":"","CreateTime":"2022-08-17T07:13:45.153329761Z","UpdateTime":"2022-08-17T07:13:45.181644002Z"}}\n" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=

ddd1123 avatar Aug 17 '22 07:08 ddd1123

恢复阶段报错的operator日志:

time="2022-08-17T07:14:08Z" level=info msg="execute identifier: {ContainerObjectMeta:{Id:fff09b30e7e8f4a2 ContainerRuntime:docker ContainerId:3e1db8dce103 ContainerName:centos-tc-done PodName:centos-tc-done-6b584445b9-g5hnw NodeName:192.168.0.4 Namespace:centos-tc-done} Command: --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker Error: Code:0 ChaosBladePodName:chaosblade-tool-ljpcv ChaosBladeNamespace:chaosblade ChaosBladeContainerName:chaosblade-tool}" experiment=f9e03fa41bc7c31f time="2022-08-17T07:14:08Z" level=info msg="Exec command in pod" command="[ --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-17T07:14:08Z" level=error msg="Invoke exec command error" command="[ --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker]" container=chaosblade-tool err= error="command terminated with exit code 126" out="OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "": executable file not found in $PATH: unknown" podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-17T07:14:08Z" level=error msg="pods/exec: k8s exec failed, err: command terminated with exit code 126" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=fff09b30e7e8f4a2

ddd1123 avatar Aug 17 '22 07:08 ddd1123

这可能是平台侧设置的轮训时间太短导致的异常,实际上实验不久后就被正常销毁了,你可以通过观察现象判断 实验是否被正常销毁

Icesource avatar Aug 18 '22 08:08 Icesource

这可能是平台侧设置的轮训时间太短导致的异常,实际上实验不久后就被正常销毁了,你可以通过观察现象判断 实验是否被正常销毁

感谢回复 尝试了几次并进行观察现象,均没能销毁实验。通过tc qdisc show查看仍存在tc qdisc add ... 添加的实验内容

通过查看报错信息"error": "pods/exec: k8s exec failed, err: command terminated with exit code 126",考虑是因为恢复时并没有成功进入对应的pod,故障注入是能够成功进入的,而恢复不能进入pod就有点问题

ddd1123 avatar Aug 23 '22 09:08 ddd1123

在pod内看看chaosblade的执行日志呢? 日志一般在/opt/chaosblade下

Icesource avatar Aug 23 '22 09:08 Icesource

在pod内看看chaosblade的执行日志呢? 日志一般在/opt/chaosblade下 感谢回复

日志如下,其中10:43为成功执行,10:45为恢复日志 time="2022-08-23 10:43:28.128243385 UTC" level=info msg="create uid: 72e94f5b8b62644b, target: network, scope: pod, action: delay" time="2022-08-23 10:43:28.142013125 UTC" level=error msg="chaosblade result: []" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b time="2022-08-23 10:45:39.431496547 UTC" level=info msg="destroy by 72e94f5b8b62644b uid, force-remove: false, target: " time="2022-08-23 10:45:39.65012422 UTC" level=error msg="unexpected status, expected status: destroyed, but the real status: Running, please wait!" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b time="2022-08-23 10:45:43.434151464 UTC" level=error msg="chaosblade result: [{pod network delay false Success see resStatus for the error details [{fd98461695b31b38 Error 0 pods/exec: k8s exec failed, err: command terminated with exit code 126 false pod centos-tc/192.168.0.3/centos-tc-5bc68ff56f-f46fl/centos-tc-done/46d20d1c607c/docker}]}]" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b

ddd1123 avatar Aug 23 '22 10:08 ddd1123

pod network delay 实验时,销毁实验失败: /opt/chaosblade/bin/nsexec -t 11077 -p -n -- /bin/sh -c tc qdisc del dev eth0 root`: cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2

在响应node节点的 chaosblade-tool 容器中执行 /opt/chaosblade/bin/nsexec -t 11077 -p -n -- /bin/sh -c tc qdisc del dev eth0 root 同样报错,需要将执行名字加“引号”,然后再执行就可以了。 像下面这样: /opt/chaosblade/bin/nsexec -t 11077 -p -n -- "/bin/sh -c tc qdisc del dev eth0 root"

是否是因为演练工具的 exec 模块执行命令的格式不对。

zshmmm avatar Mar 01 '23 02:03 zshmmm