python icon indicating copy to clipboard operation
python copied to clipboard

BUG: `--request-telemetry` on a distant node times out the first time, then it works

Open juergenRe opened this issue 3 weeks ago • 4 comments

Somehow weird behavior:

  1. Start-up node x, wait a few minutes, then --request-telemetry --dest <y>: here I do not get a response and the request times out
  2. After time-out or aborting command after ca. 15s: retr the same command: now the desired result is shown.
  3. After waiting a few minutes, points 1 and 2 repeat.

Both nodes are in hop=0 distance, one is a RAK, the other is a Heltec V3 connected over WLAN (but I believe so far, that it will have the same behavior when connected via serial).

When connecting directly to node (over serial line), the request is always answered immediately.

juergenRe avatar Dec 04 '25 14:12 juergenRe

Doing some tests on this revealed this:

Reading fails: log file "TelemetryRequestFailed_afterWait-2.log"

  1. CLI will always send a packet to the radio requesting the data:
Connected to radio
Sending environment_metrics telemetry request to !c5db8a75 on channelIndex:0 (this could take a while)
DEBUG file:mesh_interface.py sendData line:544 Serializing protobuf as data: environment_metrics { }
DEBUG file:mesh_interface.py sendData line:547 len(data): 2
DEBUG file:mesh_interface.py sendData line:548 mesh_pb2.Constants.DATA_PAYLOAD_LEN: 233
DEBUG file:mesh_interface.py sendData line:571 Setting a response handler for requestId 1271786526
DEBUG file:mesh_interface.py _sendPacket line:1010 Sending packet: to: 3319499381 decoded { portnum: TELEMETRY_APP payload: "\032\000" want_response: true } id: 1271786526 hop_limit: 3 priority: RELIABLE
DEBUG file:stream_interface.py _sendToRadioImpl line:124 Sending: TS: 2025-12-08T11:00:24.524575 packet { to: 3319499381 decoded { portnum: TELEMETRY_APP payload: "\032\000" want_response: true } id: 1271786526 hop_limit: 3 priority: RELIABLE }
  1. it seems to me that the packet has arrived in the radio - the queue contains this message id:
DEBUG file:mesh_interface.py _handleFromRadio line:1308 Received from radio: queueStatus {
  free: 15
  maxlen: 16
  mesh_packet_id: 2156626112
}
  1. In case the communication fails, there is no further packet received with the ID (here: 1271786526).
  2. But later on (after 4mins) the radio may respond, but with a different ID (2671751758, timestamp 2025-12-08T11:04:48.699962). In other cases there is no response with environmental data at all.

Reading succeeds: log file "TelemetryRequestSucceeded_afterWait.log"

  1. Sending log is the same as when it fails till feedback from the queue.
  2. In case of success, the radio sends this:
DEBUG file:mesh_interface.py _handleFromRadio line:1296 in mesh_interface.py _handleFromRadio() TS: 2025-12-08T11:07:30.275885 fromRadioBytes: b'\x12d\ru\x8a\xdb\xc5\x15\x08l)\xdb"\x1c\x08C\x12\x11\r]\xa36i\x1a\n\r)\\\xbdA\x1d1\x80rD5\xc0\x80\x8b\x80H\x005Obq\xb6=`\xa36iE\x00\x00\xb8@H\x02x\x02\x82\x01 \x85\xed\xcav\x83\x96\x18\x00R\x93\x93-\x8c\xd21\x9e\x15\x12\xec\xf4\xe3!\xb8\x87\xe0[\x14iq\x82\rU\x88\x01\x01\x98\x01u'
DEBUG file:mesh_interface.py _handleFromRadio line:1308 Received from radio: packet {
  from: 3319499381
  to: 3676924936
  decoded {
    portnum: TELEMETRY_APP
    payload: "\r]\2436i\032\n\r)\\275A\0351\200rD"
    request_id: 2156626112
    bitfield: 0
  }
  id: 3060884047
  rx_time: 1765188448
  rx_snr: 5.75
  hop_limit: 2
  hop_start: 2
  public_key: "\205\355\312v\203\226\030\000R\223\223-\214\3221\236\025\022\354\364\343!\270\207\340[\024iq\202\rU"
  pki_encrypted: true
  relay_node: 117
}

Now the CLI can associate the pending request with this response using the message ID and everything terminates as expected.

From the log I believe there is some mix-up of messages in the radio itself. When the communication fails, it seems to me that there are also a lot of other telemetry packets sent from the radio and it might be that there is a race condition in answering the requests.

A possible workaround might be that the CLI resends the request after the timeout, but this is also dangerous because it might congest the communication channel.

Environment used:

FW: 2.6.11.60ec05e on both devices CLI: 2.7.3 HW: RAK4630 with BMP280; HelTec V3 connected via serial port

TelemetryRequestFailed_afterWait-2.log TelemetryRequestSucceeded_afterWait.log

juergenRe avatar Dec 08 '25 10:12 juergenRe

Are you seeing that when you send a --request-telemetry command with a specific destination, you can see the a packet which fits that criteria? Are you by chance seeing that a --request-telemetry command sent disobeys the --dest flag, and instead just sends a TELEMETRY_APP packet to broadcast from the sender?

Please let me know, because this is the behavior I am witnessing and I need to know if it's related or should have a new bug report filed.

rjDipcord avatar Dec 09 '25 18:12 rjDipcord

Yes, I send to a specific node and the CLI ask the data from this node. In the first log example you can see the request going from the local node to !c5db8a75 = 3319499381.

For all logs I have been looking into, this was always the case. So no broadcast. BTW, I noticed such "lost messages" also for other commands, especially for the --configure.

I would like to see in more detail whats going on inside the radio. Can I switch on some addional logging over a serial line or something like that? Where do I have to look for this?

juergenRe avatar Dec 10 '25 08:12 juergenRe

The issue is not that simple to reproduce as I thought. In the meantime I found more symptoms:

  1. I did some reconfiguration from network access back to bluetooth, then the device was sitting around for one day, collecting other nodes during that time. When I was retrying telemetry this morning, it worked always. But - strange enough - the node was rebooting each time the request was answered.
  2. Changing back to network access did not change this.
  3. Did a factory reset followed by a reconfiguration with network. Telemetry now does not work consistently. When time-out occurs or I abort the request, a reboot of the radio I connected to (not the node I ask the telemetry) does a reboot.

All this with the same SW settings as described above.

juergenRe avatar Dec 11 '25 08:12 juergenRe