log-pilot log-pilot用在k8s上收集日志偶尔会出现丢数据的问题

log-pilot用在k8s上收集日志偶尔会出现丢数据的问题

Open wiliiwin opened this issue 3 years ago • 15 comments

你好，我们现在在k8s上用到了log-pilot，但是k8s的业务容器收集的日志会偶发的出现日志无法收集的问题，当我们把业务容器重新部署或者重启之后就有可以收集了，并且把没有收集到的日志按当前时间写入到了es，这个问题一直存在。不知道是什么原因会导致这样的问题，麻烦进行解答下，谢谢。

Feb 07 '21 07:02 wiliiwin

你好，我们现在在k8s上用到了log-pilot，但是k8s的业务容器收集的日志会偶发的出现日志无法收集的问题，当我们把业务容器重新部署或者重启之后就有可以收集了，并且把没有收集到的日志按当前时间写入到了es，这个问题一直存在。不知道是什么原因会导致这样的问题，麻烦进行解答下，谢谢。

有可能是这个原因，看我19楼的回复 https://www.cnblogs.com/William-Guozi/p/elk-k8s.html

Feb 08 '21 09:02 WilliamGuozi

你好，我们现在在k8s上用到了log-pilot，但是k8s的业务容器收集的日志会偶发的出现日志无法收集的问题，当我们把业务容器重新部署或者重启之后就有可以收集了，并且把没有收集到的日志按当前时间写入到了es，这个问题一直存在。不知道是什么原因会导致这样的问题，麻烦进行解答下，谢谢。

有可能是这个原因，看我19楼的回复 https://www.cnblogs.com/William-Guozi/p/elk-k8s.html

你好，我看了下19楼的回复，这个是对日志文件进行了清空的处理，但是我们是从标准输出直接收集的，不存在手动清空日志的情况。

Feb 08 '21 09:02 wiliiwin

你好，我们现在在k8s上用到了log-pilot，但是k8s的业务容器收集的日志会偶发的出现日志无法收集的问题，当我们把业务容器重新部署或者重启之后就有可以收集了，并且把没有收集到的日志按当前时间写入到了es，这个问题一直存在。不知道是什么原因会导致这样的问题，麻烦进行解答下，谢谢。

有可能是这个原因，看我19楼的回复 https://www.cnblogs.com/William-Guozi/p/elk-k8s.html

你好，我看了下19楼的回复，这个是对日志文件进行了清空的处理，但是我们是从标准输出直接收集的，不存在手动清空日志的情况。

那你标准输出的文件类型是什么？或者说你 log-pilot 的日志有什么表现？

Feb 08 '21 09:02 WilliamGuozi

你好，我们现在在k8s上用到了log-pilot，但是k8s的业务容器收集的日志会偶发的出现日志无法收集的问题，当我们把业务容器重新部署或者重启之后就有可以收集了，并且把没有收集到的日志按当前时间写入到了es，这个问题一直存在。不知道是什么原因会导致这样的问题，麻烦进行解答下，谢谢。

有可能是这个原因，看我19楼的回复 https://www.cnblogs.com/William-Guozi/p/elk-k8s.html

你好，我看了下19楼的回复，这个是对日志文件进行了清空的处理，但是我们是从标准输出直接收集的，不存在手动清空日志的情况。

那你标准输出的文件类型是什么？或者说你 log-pilot 的日志有什么表现？

我们的标准输出文件是json格式的，我看了下log-pilot的日志，没有你说的19楼的重启的情况。我们的日志我看了下，基本都是如下的:

` time="2021-02-08T10:01:16+08:00" level=info msg="log config f14d29cb4684afb9a4496d3117b0026d02f5331cdede718e9ec6f81a8cdab7b8.yml has been removed and ignore"

time="2021-02-08T10:30:03+08:00" level=debug msg="Process container destory event: e30ca97f68ff5c7dda3511227c07eb6b2d475cbf1a4da067120fed165d44243d"

time="2021-02-08T10:30:03+08:00" level=info msg="begin to watch log config: e30ca97f68ff5c7dda3511227c07eb6b2d475cbf1a4da067120fed165d44243d.yml"

time="2021-02-08T10:30:03+08:00" level=debug msg="Process container start event: 9bae15af15e4feae5afdbd7ae7482d1e017dc90684b00c986144bc8f60c31f04"

time="2021-02-08T10:30:03+08:00" level=info msg="logs: 9bae15af15e4feae5afdbd7ae7482d1e017dc90684b00c986144bc8f60c31f04 = &{std /host/data/docker/containers/9bae15af15e4feae5afdbd7ae7482d1e017dc90684b00c986144bc8f60c31f04 nonex map[time_format:%Y-%m-%dT%H:%M:%S.%NZ] 9bae15af15e4feae5afdbd7ae7482d1e017dc90684b00c986144bc8f60c31f04-json.log* map[stage:prod index:prod-crs-k8s-micro-scrm-json-log topic:prod-crs-k8s-micro-scrm-json-log] prod-crs-k8s-micro-scrm-json-log false true}"

time="2021-02-08T10:30:03+08:00" level=info msg="Reload filebeat"

time="2021-02-08T10:30:03+08:00" level=info msg="Start reloading"

time="2021-02-08T10:30:03+08:00" level=debug msg="not need to reload filebeat"

time="2021-02-08T10:30:16+08:00" level=info msg="try to remove log config e30ca97f68ff5c7dda3511227c07eb6b2d475cbf1a4da067120fed165d44243d.yml"

time="2021-02-08T16:43:12+08:00" level=debug msg="Process container destory event: 59a601bb8c4202655f434d088737fe10a78e08413594a72091afa4e917d2ea13"

time="2021-02-08T16:43:12+08:00" level=info msg="begin to watch log config: 59a601bb8c4202655f434d088737fe10a78e08413594a72091afa4e917d2ea13.yml"

time="2021-02-08T16:43:18+08:00" level=info msg="try to remove log config 59a601bb8c4202655f434d088737fe10a78e08413594a72091afa4e917d2ea13.yml"

time="2021-02-08T16:43:19+08:00" level=debug msg="Process container start event: d842e7efc4c2eca11a350795a4189651293d74a1659d9dea59bc16d191e582fc"

time="2021-02-08T16:43:19+08:00" level=debug msg="d842e7efc4c2eca11a350795a4189651293d74a1659d9dea59bc16d191e582fc has not log config, skip"

time="2021-02-08T16:43:22+08:00" level=debug msg="Process container start event: fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c"

time="2021-02-08T16:43:22+08:00" level=info msg="logs: fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c = &{std /host/data/docker/containers/fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c nonex map[time_format:%Y-%m-%dT%H:%M:%S.%NZ] fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c-json.log* map[topic:uat-crs-k8s-gw-scrm-json-log stage:uat index:uat-crs-k8s-gw-scrm-json-log] uat-crs-k8s-gw-scrm-json-log false true}"

time="2021-02-08T16:43:22+08:00" level=info msg="Reload filebeat"

time="2021-02-08T16:43:22+08:00" level=info msg="Start reloading"

time="2021-02-08T16:43:22+08:00" level=debug msg="not need to reload filebeat"

time="2021-02-08T16:43:53+08:00" level=debug msg="Process container destory event: 9b87ec99da04ad52a449daf2ec4bd2c4247ff064ee878c4754d782c965a24907"

time="2021-02-08T16:43:53+08:00" level=info msg="begin to watch log config: 9b87ec99da04ad52a449daf2ec4bd2c4247ff064ee878c4754d782c965a24907.yml"

time="2021-02-08T16:44:18+08:00" level=info msg="log config 9b87ec99da04ad52a449daf2ec4bd2c4247ff064ee878c4754d782c965a24907.yml has been removed and ignore"

time="2021-02-08T17:46:46+08:00" level=debug msg="Process container destory event: fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c"

time="2021-02-08T17:46:46+08:00" level=info msg="begin to watch log config: fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c.yml"

time="2021-02-08T17:46:52+08:00" level=debug msg="Process container start event: 34671cfb9a4baaf80ccec584f956cf4cdffd7a9006dbfcb54ccbb8fb8501b06d"

time="2021-02-08T17:46:52+08:00" level=debug msg="34671cfb9a4baaf80ccec584f956cf4cdffd7a9006dbfcb54ccbb8fb8501b06d has not log config, skip"

time="2021-02-08T17:46:54+08:00" level=debug msg="Process container start event: 6328005d98c0a5bdf30ca6ecc0ff922e5e5ddab2d8210dea3ec901fd0e5175b1"

time="2021-02-08T17:46:54+08:00" level=info msg="logs: 6328005d98c0a5bdf30ca6ecc0ff922e5e5ddab2d8210dea3ec901fd0e5175b1 = &{std /host/data/docker/containers/6328005d98c0a5bdf30ca6ecc0ff922e5e5ddab2d8210dea3ec901fd0e5175b1 nonex map[time_format:%Y-%m-%dT%H:%M:%S.%NZ] 6328005d98c0a5bdf30ca6ecc0ff922e5e5ddab2d8210dea3ec901fd0e5175b1-json.log* map[stage:uat index:uat-crs-k8s-gw-scrm-json-log topic:uat-crs-k8s-gw-scrm-json-log] uat-crs-k8s-gw-scrm-json-log false true}"

time="2021-02-08T17:46:54+08:00" level=info msg="Reload filebeat"

time="2021-02-08T17:46:54+08:00" level=info msg="Start reloading"

time="2021-02-08T17:46:54+08:00" level=debug msg="not need to reload filebeat"

time="2021-02-08T17:46:55+08:00" level=debug msg="Process container destory event: d842e7efc4c2eca11a350795a4189651293d74a1659d9dea59bc16d191e582fc"

time="2021-02-08T17:46:55+08:00" level=info msg="begin to watch log config: d842e7efc4c2eca11a350795a4189651293d74a1659d9dea59bc16d191e582fc.yml"

time="2021-02-08T17:47:19+08:00" level=info msg="try to remove log config fee24026f2963e354b83a60f9ea2badac2495c169670f67317fcef3d4feb936c.yml"

time="2021-02-08T17:47:19+08:00" level=info msg="log config d842e7efc4c2eca11a350795a4189651293d74a1659d9dea59bc16d191e582fc.yml has been removed and ignore" `

Feb 08 '21 09:02 wiliiwin

从日志上看是正常的，现在还有丢的情况吗？从你的描述上，会不会是日志量太大，资源不够用，收集不及时我用的最新版本， filebeat类型的，没有出现类似问题

Feb 08 '21 10:02 WilliamGuozi

从日志上看是正常的，现在还有丢的情况吗？从你的描述上，会不会是日志量太大，资源不够用，收集不及时

它是偶发的出现的，过一段时间就会出现这样的问题，并且是其中一个模块出现这样的情况，其他模块的收录都是正常的。资源上应该不没有问题，见监控的历史图压力不是很大，因此资源不够这个应该不是主要问题，并且log-pilot用的是daemonset的方式，每个node节点都会启动一个log-pilot的pod 我也用的是最新的版本的0.9.7的docker镜像，从dockerhub上拉的

Feb 08 '21 10:02 wiliiwin

”其中一个模块“，对比一下和其他模块有什么区别，特别是日志输出方式上

Feb 08 '21 10:02 WilliamGuozi

”其中一个模块“，对比一下和其他模块有什么区别，特别是日志输出方式上

我上面说的其中一个模块是不固定的，并不是这一个模块老出问题，其他模块没有出现过问题。这里表达的是丢失数据的时间跨度上去看的。

没有区别的，因为我们的部署是helm模板化的，因此所有的yaml文件对于写入log-pilot的配置是一致的，不存在差异化的。如下是我们对于业务容器配置的log-pilot的配置，不知道这个配置是否有问题。

env: - name: pilot_logs_std value: stdout - name: pilot_logs_std_target value: uat-crs-k8s-console-json-log - name: pilot_logs_std_tags value: stage=uat

Feb 08 '21 10:02 wiliiwin

_分成的三段，我之前遇到过，如果多段会有问题

Feb 08 '21 10:02 WilliamGuozi

_分成的三段

没有看懂你说的这个是啥意思

Feb 08 '21 16:02 wiliiwin

??? 就是容器启动时的环境变量名的命名格式呀！！！aliyun_logs_$name , 必须三段

Feb 13 '21 03:02 WilliamGuozi

??? 就是容器启动时的环境变量名的命名格式呀！！！aliyun_logs_$name , 必须三段

https://developer.aliyun.com/article/674327 我是看的阿里云提供的这篇文章配置的他这篇文章里面这里也是四段的

Feb 17 '21 03:02 wiliiwin

没有人遇到这个问题吗？

Mar 02 '21 01:03 wiliiwin

有没有可能是日志收集数量多的缘故，例如只保留一个stdout和容器内的数据日志，再观察看看。

追加：

我业务中也遇到过这样的现象，由于某个服务的日志量很大，出现日志采集不全的情况，现在解决了，解决大致步骤如下：

1、为这个业务量大的服务单独准备一套log-pilot，只收集这个服务的日志 2、优化log-pilot的 filebeat.tpl日志，我优化后的配置模板，可供对比参考

    {{range .configList}}
    - type: log
      enabled: true
      paths:
          - {{ .HostDir }}/{{ .File }}
      scan_frequency: 5s
      fields_under_root: true
      {{if .Stdout}}
      docker-json: true
      {{end}}
      {{if eq .Format "json"}}
      json.keys_under_root: true
      {{end}}
      fields:
          {{range $key, $value := .Tags}}
          {{ $key }}: {{ $value }}
          {{end}}
          {{range $key, $value := $.container}}
          {{ $key }}: {{ $value }}
          {{end}}
      tail_files: false
      close_inactive: 2h
      close_eof: false
      close_removed: true
      clean_removed: true
      close_renamed: false

Jun 24 '22 03:06 bogeit

我也遇到了相同的问题在一个时间段内某个应用的日志会有采集不全的现象。。

Aug 30 '22 08:08 landyli

log-pilot log-pilot copied to clipboard

log-pilot用在k8s上收集日志偶尔会出现丢数据的问题

log-pilot
log-pilot copied to clipboard