kindling
kindling copied to clipboard

Published 20 hours ago •

KindlingProject

Reame
Issues

kindling_tcp_connect_total无法真实反应容器之间是否有tcp建联失败

Open xuchuan-666 opened this issue 1 year ago • 6 comments

Describe the bug prosql：increase(kindling_tcp_connect_total{success="false"}[2m]) 在服务与服务之间，总是有数值出现 How to reproduce? 部署kubernetes集群，网络采用calico的ipip的overlay网络模式，部署任意java程序之间调用即可复现 What did you expect to see? increase(kindling_tcp_connect_total{success="false"}[2m]) 这个指标可以真实的反应两个pod之间是否tcp链接失败的情形，数据准确性提高

What did you see instead?

1689561557677

框中的数据都是误报出来的数据

Screenshots

What config did you use?

kindlingproject/kindling-agent:latesttest kindlingproject/kindling-grafana:latesttest

Logs

Environment (please complete the following information)

Kindling agent version
Kindlinng-falcon-lib version
Node OS version
Node Kernel version
Kubernetes version
Prometheus version
Grafana version

Additional context

Jul 17 '23 02:07 xuchuan-666

请问是怎么确定这些数据是“误报”的？这些调用根本不存在还是存在调用但没有发生“建连失败”？

Jul 18 '23 02:07 dxsup

请问是怎么确定这些数据是“误报”的？这些调用根本不存在还是存在调用但没有发生“建连失败”？

这些调用存在，但是没有发生“建联失败”的情况，我们服务的调用及日志都没有任何的异常，但是通过kindling采集出来的数据，却时不时的会有显示tcp建联失败

Jul 20 '23 03:07 xuchuan-666

我们应用的场景也比较简单，无论是集群服务之间的调用，还是集群服务与集群外部中间件之间的调用，都会不定时的会显示tcp建连失败的数据，但是我们排查了业务的日志，发现根本没有任何的错误输出，并且不只一个业务会出现这种问题，所以怀疑采集出来的数据有问题

Jul 20 '23 03:07 xuchuan-666

麻烦打开debug日志，然后把日志发出来，我看一下tcpconnectanalyzer中收到的数据情况。

方法为在配置文件中修改observability.console_level为debug，然后在observability.debug_selector增加tcpconnectanalyzer。再使用kubectl logs将日志重定向到文件中，然后把文件贴出来。

这个日志建议打印5分钟，这段时间内要出现过“误报的建连失败”指标。

Jul 20 '23 05:07 dxsup

2.txt 0358a44100bd16129b5a8c2d7fb371d 58fab7587abaede58896aab485a035b

Jul 20 '23 06:07 xuchuan-666

在采集的数据中kindling_tcp_connect_total{errno="-2",success="false"}，errno的value为-2，这个报错会在UnixSocketDomain类型下发生，应该把socket类型是AF_UNIX的过滤掉，这类不算TCP

Aug 01 '23 08:08 xuchuan-666