telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

tail plugin issue

Open tbs575 opened this issue 1 year ago • 7 comments

Relevant telegraf.conf

[global_tags]
  job = "yibo-in-out"

[[inputs.tail]]
  files = ["D:\\LPN5.07\\SystemLog\\*.log"]
  # Read file from beginning.
  from_beginning = false
  data_format = "grok"
  grok_patterns = ['^.*\[INFO\].*上报请求信息%{GREEDYDATA:message}$']

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false
  ## If true and the info is available then add core_id and physical_id tags
  core_tags = false

[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true
  debug = true
  logtarget = "file"
  logfile = "C:\\Program Files\\telegraf\\logs.log"
  logfile_rotation_max_size = "10MB"
  logfile_rotation_max_archives = 7
  log_with_timezone = "Asia/Shanghai"


  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 10

[[processors.rename]]
  ## Specify one sub-table per rename operation.
  [[processors.rename.replace]]
    measurement = "tail"
    dest = "parking-in-out"

[[processors.parser]]
  parse_fields = ["message"]
  drop_original = true
  json_string_fields = ["VehicleNo","ChannelName"]
  json_time_key = "EventTime"
  json_time_format = "2006-01-02 15:04:05"
  data_format = "json"
  json_timezone = "Asia/Shanghai"

[[outputs.influxdb_v2]]
  urls = ["http://10.200.101.17:8086"]
  ## Token for authentication.
  token = "xxxxxxx"
  organization = "icep"
  bucket = "yiboinfo"

Logs from Telegraf

2023-07-07T18:03:01+08:00 D! [parsers.grok::tail] Grok no match found for: "[2023:07:07 18:02:58][DEBUG]出口通道号16,发送系统时间2023年07月07日"
2023-07-07T18:03:01+08:00 D! [parsers.grok::tail] Grok no match found for: "[2023:07:07 18:02:58][ERROR]通道号[2],IP[172.16.60.11]登陆成功后不能通讯"
2023-07-07T18:03:06+08:00 D! [outputs.influxdb_v2] Wrote batch of 4 metrics in 1.0002ms
2023-07-07T18:03:06+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-07-07T18:03:16+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-07-07T18:03:26+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-07-07T18:03:36+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-07-07T18:03:46+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-07-07T18:03:56+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

System info

telegraf 1.27, Windows server 2012 R2 Datacenter

Docker

No response

Steps to reproduce

  1. set telegraf.conf file, and start. telegraf grok record and send out to influxdb.
  2. in the beginning, working file, telegraf can get record, but hours, telegraf can not get any record. although new message coming into logs file.

Expected behavior

telegraf can analysis new message in logs file continually,

Actual behavior

telegraf is running ( from telegraf log, seemed telegraf working fine, no any issue report), but telegraf can not get new message and grok.

new message still coming from log file in D:\\LPN5.07\\SystemLog

2023:07:07 18:02:58][DEBUG]出口通道号16,发送系统时间2023年07月07日
[2023:07:07 18:02:58][INFO]出场场上报请求信息{"ParkCode":"510105037","VehicleNo":"GJ9L91","EventTime":"2023-07-07 18:02:20","StartTime":"2023-07-07 12:59:39","PaymentMoney":14.0,"TotalMoney":14.0,"DeductionMoney":0.0,"IsAutoPay":false,"CouponName":"","CouponMoney":0.0,"OperateCount":1738,"Qn":"2307071802200000008","BookingCode":"","ConsumeType":1,"PayType":2,"ChannelNo":8,"ChannelName":"3号楼出口2","CarImageID":"2307071802200000008","ISETCPay":false,"ETCPaymentMoney":0.0}
[2023:07:07 18:02:58][ERROR]通道号[2],IP[172.16.60.11]登陆成功后不能通讯
[2023:07:07 18:02:58][ERROR]通道号[4],IP[172.16.60.13]登陆成功后不能通讯
[2023:07:07 18:02:58][ERROR]通道号[1],IP[172.16.60.10]一体机重连不能通讯
[2023:07:07 18:02:58][ERROR]通道号[13],IP[10.18.212.12]一体机重连不能通讯
[2023:07:07 18:02:58][ERROR]通道号[11],IP[10.18.212.10]一体机重连不能通讯
[2023:07:07 18:02:58][ERROR]通道号[3],IP[172.16.60.12]一体机重连不能通讯
[2023:07:07 18:03:00][ERROR]通道号[2],IP[172.16.60.11]登陆成功后不能通讯

Additional info

No response

tbs575 avatar Jul 10 '23 02:07 tbs575

@tbs575,

It is not immediately obvious what might be going on here. Some questions:

  1. How large is this log file?
  2. How long was telegraf working for before it stopped working?
  3. Are you still getting debug messages about no match found for other lines that are coming in?
  4. Was the file rotated?
  5. If you restart telegraf, are the new lines parsed and sent correctly?

Thanks!

powersj avatar Jul 10 '23 19:07 powersj

@powersj

  1. application write message into log file continually, application logs rotates every day. logs file size 40+M ~ 30+M.
  2. checked telegraf log, found 3 times for stopping work, 2 times for 4 hours( that mean telegraf woking fine for 4 hours), and one time for 13 hours
  3. Still get debug message, like 2023-07-11T09:31:57+08:00 D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics. when telegraf stop working, can not get any no match message
  4. Yes, application logs was rotated every day. every 00:00, application will create new logs. from telegraf debug message, can saw telegraf found new logs files 2023-07-11T00:00:09+08:00 D! [inputs.tail] Tail added for "D:\\LPN5.07\\SystemLog\\2023-07-11.info.log"
  5. Yes, restart telegraf, will saw 'no match` message again, grok working back again.

tbs575 avatar Jul 11 '23 01:07 tbs575

Follow-up questions:

  1. When telegraf stops working do you continue to see metrics from the CPU plugin? Or do all metrics stop gathering?
  2. Do you see any "Wrote batch of 4 metrics in 1.0002ms" like messages?
  3. Can you please provide the full logs? I'd like to see the full flow of what happens.

powersj avatar Jul 11 '23 13:07 powersj

  1. all metrics stop
  2. after stop working, can not saw 'Wrote batch ...", just only 'Buffer fullness: 0 / 10000 metrics'
  3. please check logs.log

more information,

  1. I setup another telegraf on windows server 2012 R2 and similar application logs, that working fine, no any issue

  2. application logs status can not be updated ( maybe windows OS issue, not sure), although new message into application log file, but just refresh windows explorer, application log file size is same, no any increase. I just open log file by text editor, without changing, exist directly, found file size increase. this case make telegraf can not grok. but I test filebeat, filbeat did not be impact, filebeat can grok

tbs575 avatar Jul 12 '23 02:07 tbs575

@powersj looking at the logs you can see that a lot of files are open (523 are closed at the end) because the "old" log-files are never closed and are kept open... Is it possible that we run into some filehandle limit or similar?

@tbs575 do you see anything in the Windows even-log around the time when Telegraf stops?

srebhan avatar Jul 12 '23 13:07 srebhan

@srebhan telegraf stop grok new message at 2023-07-11T14:34:28+08:00 in telegraf log, and at 2023-07-12T00:00:09+08:00 found new application's log file.

check windows event-log at 2023-07-11T14:34:28+08:00, there is nothing that can be noticed.

tbs575 avatar Jul 13 '23 05:07 tbs575

I don't know if it's released or not, but at least it worth mentioning that I see something similar w/ the KNX plugin. Issue #14695

alexander-zimmermann avatar Feb 07 '24 07:02 alexander-zimmermann