telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

modbus plugin can get out of sync WRT transaction IDs

Open cmock opened this issue 2 years ago • 4 comments

Relevant telegraf.conf

[[inputs.modbus]]
name = "pv_inverter"
slave_id = 1
timeout = "5s"
controller = "tcp://192.168.66.3:502"
configuration_type = "metric"
optimization = "max_insert"

[inputs.modbus.workarounds]
pause_after_connect = "2000ms"

[[inputs.modbus.metric]]
slave_id = 1
measurement = "pv_inverter" # Huawei Sun2000
# cf https://github.com/wlcrs/huawei_solar/wiki
fields = [
  { address=32064, name="P_in", type="INT32", scale=1.0 },
  { address=32069, name="L1_U", type="UINT16", scale=0.1 },
  { address=32070, name="L2_U", type="UINT16", scale=0.1 },
  { address=32071, name="L3_U", type="UINT16", scale=0.1 },
  { address=32072, name="L1_I", type="INT32", scale=0.001 },
  { address=32074, name="L2_I", type="INT32", scale=0.001 },
  { address=32076, name="L3_I", type="INT32", scale=0.001 },
  { address=32080, name="P_out", type="INT32", scale=1.0 },
  { address=32082, name="Q_out", type="INT32", scale=1.0 },
  { address=32084, name="Power_Factor", type="INT16", scale=0.001 },
  { address=32085, name="Frequency", type="UINT16", scale=0.01 },
  { address=32086, name="Efficiency", type="UINT16", scale=0.01 },
  { address=32087, name="Temperature", type="INT16", scale=0.1 },
]

  [inputs.modbus.metric.tags]
  null="maybe"


[[inputs.modbus.metric]]
slave_id = 1
measurement = "pv_inverter" # Huawei Sun2000
# cf https://github.com/wlcrs/huawei_solar/wiki
fields = [
  { address=32000, name="state_1", type="UINT16" },
  { address=32002, name="state_2", type="UINT16" },
  { address=32003, name="state_3", type="UINT16" },
  { address=32008, name="alarm_1", type="UINT16" },
  { address=32009, name="alarm_2", type="UINT16" },
  { address=32010, name="alarm_3", type="UINT16" },
  { address=32016, name="PV1_U", type="INT16", scale=0.1 },
  { address=32017, name="PV1_I", type="INT16", scale=0.01 },
  { address=32018, name="PV2_U", type="INT16", scale=0.1 },
  { address=32019, name="PV2_I", type="INT16", scale=0.01 },
  { address=32078, name="Peak_Power_today", type="INT32", scale=1.0 },
  { address=32088, name="R_iso", type="UINT16", scale=1000.0 },
  { address=32089, name="device_status", type="UINT16" },
  { address=32090, name="fault_code", type="UINT16" },
  { address=32106, name="energy_total", type="UINT32", scale=10.0},
]

Logs from Telegraf

Nov 18 07:02:05 rockpro telegraf[1449366]: 2023-11-18T06:02:05Z E! [inputs.modbus] Error in plugin: slave 1: read tcp 192.168.57.7:56970->192.168.66.3:502: i/o timeout
Nov 18 07:02:10 rockpro telegraf[1449366]: 2023-11-18T06:02:10Z E! [inputs.modbus] Error in plugin: slave 1: modbus: response transaction id '7616' does not match request '7617'
Nov 18 07:02:20 rockpro telegraf[1449366]: 2023-11-18T06:02:20Z E! [inputs.modbus] Error in plugin: slave 1: modbus: response transaction id '7617' does not match request '7618'
Nov 18 07:02:30 rockpro telegraf[1449366]: 2023-11-18T06:02:30Z E! [inputs.modbus] Error in plugin: slave 1: modbus: response transaction id '7618' does not match request '7619'
Nov 18 07:02:40 rockpro telegraf[1449366]: 2023-11-18T06:02:40Z E! [inputs.modbus] Error in plugin: slave 1: modbus: response transaction id '7619' does not match request '7620'
Nov 18 07:02:50 rockpro telegraf[1449366]: 2023-11-18T06:02:50Z E! [inputs.modbus] Error in plugin: slave 1: modbus: response transaction id '7620' does not match request '7621'

[these lines repeat until telegraf is restarted]

System info

telegraf 1.28.5-1, Debian 12.2

Docker

No response

Steps to reproduce

The issue occurs occasionally; telegraf runs fine for weeks on end, but if this error is occuring, it does not recover by itself.

Expected behavior

I guess closing/reopening the Modbus connection after an i/o error, or at least resetting the internal counters for the transaction IDs, would be a solution.

Actual behavior

After this desynchronization the connection never recovers, until telegraf is restarted.

Additional info

The device being monitored is a Huawei Sun2000 solar inverter, whose modbus implementation is a bit flakey.

Not every "i/o timeout" message in the logs leads to this behaviour.

cmock avatar Nov 18 '23 18:11 cmock

@cmock this seems to be an error in the upstream library. Can you please report the error there and add a link to this issue!?

srebhan avatar Nov 20 '23 23:11 srebhan

done: https://github.com/grid-x/modbus/issues/78

cmock avatar Nov 21 '23 11:11 cmock