chaosblade-operator
chaosblade-operator copied to clipboard
<POD NETWORK DELAY> : cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2
Issue Description
Type: bug report
Describe what happened (or what feature you want)
- 在 chaosblade-box 中通过 agent 获取 K8s 集群信息,进行 POD NETWORK DELAY 演练;
- 报错内容为:原因:
/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms
: cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2 - 我尝试在对应的 Node 主机上输入 /opt/chaosblade/bin/nsexec,同样报 -bash: /opt/chaosblade/bin/nsexec: No such file or directory
Describe what you expected to happen
希望可以提供相关解决方法or解决思路,thanks!
How to reproduce it (as minimally and precisely as possible)
Tell us your environment
K8s:v1.18.18 chaosblade-box:v1.0.1 chaos-agent:v1.0.0 chaos-operator:v1.6.0 chaos-tool:v1.6.0
Anything else we need to know?
机器执行信息: {
"response": {
"code": 54000,
"error": "unexpected status, expected status: `create`, but the real status: `Error`, please wait!",
"result": {
"error": "`/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms`: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2",
"statuses": [
{
"error": "`/opt/chaosblade/bin/nsexec -t 77143 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms`: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2",
"kind": "pod",
"state": "Error",
"success": false
}
],
"success": false,
"uid": "2c29793a6c8da651"
},
"success": false
} }
但是我做了一个pod-process-kill的演练:
- 由于填错了signal所以报错,通过报错信息Reason:
/opt/chaosblade/bin/nsexec -t 5165 -p -m -- /bin/sh -c kill -128 128
: cmd exec failed, err: /bin/sh: line 0: kill: 128: invalid signal specification exit status 1。我发现也是用的/opt/chaosblade/bin/nsexec - 后续我更改了正确的signal后,该pod-process-kill演练成功了。
由此我感觉上述的问题是不是不在于说在node主机上输入/opt/chaosblade/bin/nsexec的报错呢? 那问题出现在哪里呢。。
补充:在进行pod-network-delay的演练时:
- 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
- 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
- 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。
operator日志:
演练进行时节选:
time="2022-08-11T01:41:38Z" level=info msg="experiment identifiers: [{{ docker 7420e6a5bff3 centos-tc centos-tc-demo-88d8ff5f8-vg278 192.168.0.4 centos-tc-demo} /opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker 0 chaosblade-tool-ljpcv chaosblade chaosblade-tool}]" experiment=c6c75e78f1a37877
time="2022-08-11T01:41:38Z" level=info msg="execute identifier: {ContainerObjectMeta:{Id: ContainerRuntime:docker ContainerId:7420e6a5bff3 ContainerName:centos-tc PodName:centos-tc-demo-88d8ff5f8-vg278 NodeName:192.168.0.4 Namespace:centos-tc-demo} Command:/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker Error: Code:0 ChaosBladePodName:chaosblade-tool-ljpcv ChaosBladeNamespace:chaosblade ChaosBladeContainerName:chaosblade-tool}" experiment=c6c75e78f1a37877
time="2022-08-11T01:41:38Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-11T01:41:38Z" level=info msg="get err message" command="[/opt/chaosblade/blade create cri network delay --timeout=125 --time=100 --interface=eth0 --offset=10 --container-id 7420e6a5bff3 --container-runtime docker]" container=chaosblade-tool err="{"code":63063,"success":false,"error":"/opt/chaosblade/bin/nsexec -t 5165 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms
: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2"}" out= podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-11T01:41:38Z" level=error msg="pods/exec
: k8s exec failed, err: {"code":63063,"success":false,"error":"/opt/chaosblade/bin/nsexec -t 5165 -p -n -- /bin/sh -c tc qdisc add dev eth0 root netem delay 100ms 10ms
: cmd exec failed, err: RTNETLINK answers: No such file or directory\n exit status 2"}\n" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=
2. kernel-modules-extra
yum install -y kernel-modules-extra 可以安装该模块,问题似乎是由于 pod 内关于 linux 内核流控工具 tc 引起的相关问题,由于内核默认缺少 netem 流控队列,所以会报错 Error: Specified qdisc not found.
但安装该模块不一定能解决 RTNETLINK answers: No such file or directory exit status 2 该问题,可以先尝试安装kernel-modules-extra该模块看是否能解决问题
补充:在进行pod-network-delay的演练时:
- 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
- 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
- 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。
你系统什么版本 centos 8.x 吗,这个包 8.x 需要安装,并且安装后还要重启机器
- kernel-modules-extra
yum install -y kernel-modules-extra 可以安装该模块,问题似乎是由于 pod 内关于 linux 内核流控工具 tc 引起的相关问题,由于内核默认缺少 netem 流控队列,所以会报错 Error: Specified qdisc not found.
但安装该模块不一定能解决 RTNETLINK answers: No such file or directory exit status 2 该问题,可以先尝试安装kernel-modules-extra该模块看是否能解决问题
好嘞,我再尝试安装下此模块,但是的确之前尝试安装遇到了问题,我尝试的yum源均提示没有此包可以安装。。 另外我在其他问题上看到有说在1.6.x以后不会用到pod内的tc了,这是真的嘛
是的,你先确认下你内核版本和发行版本吧
补充:在进行pod-network-delay的演练时:
- 由于一直报错,我曾尝试进入对应的pod内输入tc qdisc add dev eth0 root netem delay 100ms 10ms,仍报错:Error: Specified qdisc not found.
- 通过调研发现可能是缺少kernel-modules-extra包,由于种种原因还未成功安装此包
- 是和这个有关系嘛,如果此包安装成功是否上述的问题则不复存在。
你系统什么版本 centos 8.x 吗,这个包 8.x 需要安装,并且安装后还要重启机器
我的系统是CentOS Linux release 7.6.1810 (Core) 我是自己拉取的一个docker镜像,然后通过yum -y install iproute装上了tc命令
4.18.0-193.el8.x86_64
当前进展:
1、我的系统今天进行了一次变更。现在的版本是CentOS Linux release 8.2.2004 (Core) 2、随即成功安装了kernel-modules-extra包 3、<POD NETWORK DELAY>演练成功了!,但是恢复阶段报错,报错信息如下
信息: { "response": { "code": 54000, "error": "unexpected status, expected status:
destroy
, but the real status:Destroying
, please wait!", "result": { "error": "pods/exec
: k8s exec failed, err: command terminated with exit code 126", "statuses": [ { "error": "pods/exec
: k8s exec failed, err: command terminated with exit code 126", "id": "fff09b30e7e8f4a2", "kind": "pod", "state": "Error", "success": false } ], "success": false, "uid": "f9e03fa41bc7c31f" }, "success": false } } 错误:原因:pods/exec
: k8s exec failed, err: command terminated with exit code 126 排查:场景状态不匹配,请稍后再试
日志节选: time="2022-08-17T07:16:20Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status fff09b30e7e8f4a2]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-17T07:16:20Z" level=info msg="get output message" command="[/opt/chaosblade/blade status fff09b30e7e8f4a2]" container=chaosblade-tool err= out="{"code":200,"success":true,"result":{"Uid":"fff09b30e7e8f4a2","Command":"cri","SubCommand":"network delay","Flag":" --offset=5 --container-id=3e1db8dce103 --timeout=125 --container-runtime=docker --time=60 --interface=eth0","Status":"Success","Error":"","CreateTime":"2022-08-17T07:13:45.153329761Z","UpdateTime":"2022-08-17T07:13:45.181644002Z"}}" podName=chaosblade-tool-ljpcv podNamespace=chaosblade time="2022-08-17T07:16:20Z" level=error msg="
pods/exec
: k8s exec failed, err: {"code":200,"success":true,"result":{"Uid":"fff09b30e7e8f4a2","Command":"cri","SubCommand":"network delay","Flag":" --offset=5 --container-id=3e1db8dce103 --timeout=125 --container-runtime=docker --time=60 --interface=eth0","Status":"Success","Error":"","CreateTime":"2022-08-17T07:13:45.153329761Z","UpdateTime":"2022-08-17T07:13:45.181644002Z"}}\n" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=
恢复阶段报错的operator日志:
time="2022-08-17T07:14:08Z" level=info msg="execute identifier: {ContainerObjectMeta:{Id:fff09b30e7e8f4a2 ContainerRuntime:docker ContainerId:3e1db8dce103 ContainerName:centos-tc-done PodName:centos-tc-done-6b584445b9-g5hnw NodeName:192.168.0.4 Namespace:centos-tc-done} Command: --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker Error: Code:0 ChaosBladePodName:chaosblade-tool-ljpcv ChaosBladeNamespace:chaosblade ChaosBladeContainerName:chaosblade-tool}" experiment=f9e03fa41bc7c31f
time="2022-08-17T07:14:08Z" level=info msg="Exec command in pod" command="[ --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker]" container=chaosblade-tool podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-17T07:14:08Z" level=error msg="Invoke exec command error" command="[ --container-label-selector io.kubernetes.pod.name=centos-tc-done-6b584445b9-g5hnw,io.kubernetes.pod.namespace=centos-tc-done,io.kubernetes.docker.type=podsandbox --container-runtime docker]" container=chaosblade-tool err= error="command terminated with exit code 126" out="OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "": executable file not found in $PATH: unknown" podName=chaosblade-tool-ljpcv podNamespace=chaosblade
time="2022-08-17T07:14:08Z" level=error msg="pods/exec
: k8s exec failed, err: command terminated with exit code 126" location=github.com/chaosblade-io/chaosblade-spec-go/util.Errorf uid=fff09b30e7e8f4a2
这可能是平台侧设置的轮训时间太短导致的异常,实际上实验不久后就被正常销毁了,你可以通过观察现象判断 实验是否被正常销毁
这可能是平台侧设置的轮训时间太短导致的异常,实际上实验不久后就被正常销毁了,你可以通过观察现象判断 实验是否被正常销毁
感谢回复 尝试了几次并进行观察现象,均没能销毁实验。通过tc qdisc show查看仍存在tc qdisc add ... 添加的实验内容
通过查看报错信息"error": "pods/exec: k8s exec failed, err: command terminated with exit code 126",考虑是因为恢复时并没有成功进入对应的pod,故障注入是能够成功进入的,而恢复不能进入pod就有点问题
在pod内看看chaosblade的执行日志呢? 日志一般在/opt/chaosblade下
在pod内看看chaosblade的执行日志呢? 日志一般在/opt/chaosblade下 感谢回复
日志如下,其中10:43为成功执行,10:45为恢复日志
time="2022-08-23 10:43:28.128243385 UTC" level=info msg="create uid: 72e94f5b8b62644b, target: network, scope: pod, action: delay"
time="2022-08-23 10:43:28.142013125 UTC" level=error msg="chaosblade result: []" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b
time="2022-08-23 10:45:39.431496547 UTC" level=info msg="destroy by 72e94f5b8b62644b uid, force-remove: false, target: "
time="2022-08-23 10:45:39.65012422 UTC" level=error msg="unexpected status, expected status: destroyed
, but the real status: Running
, please wait!" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b
time="2022-08-23 10:45:43.434151464 UTC" level=error msg="chaosblade result: [{pod network delay false Success see resStatus for the error details [{fd98461695b31b38 Error 0 pods/exec
: k8s exec failed, err: command terminated with exit code 126 false pod centos-tc/192.168.0.3/centos-tc-5bc68ff56f-f46fl/centos-tc-done/46d20d1c607c/docker}]}]" location=github.com/chaosblade-io/chaosblade/exec/kubernetes.QueryStatus uid=72e94f5b8b62644b
pod network delay 实验时,销毁实验失败: /opt/chaosblade/bin/nsexec -t 11077 -p -n -- /bin/sh -c tc qdisc del dev eth0 root`: cmd exec failed, err: RTNETLINK answers: No such file or directory exit status 2
在响应node节点的 chaosblade-tool 容器中执行 /opt/chaosblade/bin/nsexec -t 11077 -p -n -- /bin/sh -c tc qdisc del dev eth0 root 同样报错,需要将执行名字加“引号”,然后再执行就可以了。 像下面这样: /opt/chaosblade/bin/nsexec -t 11077 -p -n -- "/bin/sh -c tc qdisc del dev eth0 root"
是否是因为演练工具的 exec 模块执行命令的格式不对。