Downlink in passive roaming does not work
- [x] The issue is present in the latest release.
- [x] I have searched the issues of this repository and believe that this is not a duplicate.
What happened?
When running passive roaming the downlink XmitDataReq from hNS does not receive the response from fNS and thus the device's downlink counter is never incremented. Probably cause for this that hNS XmitDataReq http client call reuse the fNS' XmitDataReq or PRStartReq request context, but since the XmitDataReq is sent in a separate go routine the original context is cancelled too soon.
What did you expect?
hNS registering downlink as successful
Steps to reproduce this issue
Steps:
- Set up two Chirpstack NS with roaming agreement
- Join a device in hNS
- Move device out of hNS coverage and into fNS coverage
- Make sure uplink requires an answer (Confirmed uplink or mac-command)
Could you share your log output?
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.639572887Z" level=info msg="roaming: api request received" async_client=false ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 message_type=PRStartReq receiver_id=000001 sender_id=000000 transaction_id=2708645192
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.646412561Z" level=info msg="sent uplink meta-data to network-controller" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 dev_eui=0018b20000020450
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.647317203Z" level=info msg="device gateway rx-info meta-data saved" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 dev_eui=0018b20000020450
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.648128014Z" level=info msg="device-session saved" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 dev_addr=0363d4c1 dev_eui=0018b20000020450
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.656330156Z" level=info msg="finished client unary call" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 grpc.code=OK grpc.ctx_id=c4ed93c6-50ae-445c-94f6-85b2f1802c94 grpc.duration=8.843511ms grpc.method=HandleUplinkData grpc.service=as.ApplicationServerService span.kind=client system=grpc
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.751556966Z" level=info msg="pending mac-command block set" cid=LinkADRReq commands=1 ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 dev_eui=0018b20000020450
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.752367508Z" level=info msg="device-session saved" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 dev_addr=0363d4c1 dev_eui=0018b20000020450
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.752709202Z" level=info msg="storage: downlink-frame saved" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 token=54018
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.752766413Z" level=info msg="api/roaming: returned response" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 message_type=PRStartAns receiver_id=000000 result_code=Success sender_id=000001 transaction_id=2708645192
Oct 28 12:46:40 hns.example.com chirpstack-network-server[88242]: time="2021-10-28T12:46:40.752935495Z" level=error msg="downlink/data: XmitDataReq failed" ctx_id=d302c55a-49a2-4e6e-b07e-f4af9834f829 dev_eui=0018b20000020450 error="http post error: Post \"https://fns.example.com:8002\": context canceled" net_id=000000
Oct 28 14:46:40 fns.example.com chirpstack-network-server[1775656]: time="2021-10-28T14:46:40.325834985+02:00" level=info msg="gateway/mqtt: uplink frame received" gateway_id=0080000000010513 uplink_id=e5f547e2-4e0d-4310-8e95-597053be51a4
Oct 28 14:46:40 fns.example.com chirpstack-network-server[1775656]: time="2021-10-28T14:46:40.628540019+02:00" level=info msg="uplink: frame(s) collected" ctx_id=5321223e-84ab-4e75-827f-d21dc8aee4d1 mtype=ConfirmedDataUp uplink_ids="[e5f547e2-4e0d-4310-8e95-597053be51a4]"
Oct 28 14:46:40 fns.example.com chirpstack-network-server[1775656]: time="2021-10-28T14:46:40.628886063+02:00" level=info msg="uplink/data: devaddr does not match netid, assuming roaming device" ctx_id=5321223e-84ab-4e75-827f-d21dc8aee4d1 dev_addr=0363d4c1
Oct 28 14:46:40 fns.example.com chirpstack-network-server[1775656]: time="2021-10-28T14:46:40.629397827+02:00" level=info msg="uplink/data: starting passive-roaming sessions with matching netids" ctx_id=5321223e-84ab-4e75-827f-d21dc8aee4d1 dev_addr=0363d4c1
Oct 28 14:46:40 fns.example.com chirpstack-network-server[1775656]: time="2021-10-28T14:46:40.629782151+02:00" level=info msg="uplink/data: starting passive-roaming session" ctx_id=5321223e-84ab-4e75-827f-d21dc8aee4d1 dev_addr=0363d4c1 net_id=000001
Oct 28 14:46:40 fns.example.com chirpstack-network-server[1775656]: time="2021-10-28T14:46:40.752562177+02:00" level=info msg="lorawan/backend: finished backend api call" message_type=PRStartReq protocol_version=1.0 receiver_id=000001 result_code=Success sender_id=000000 transaction_id=2708645192
Your Environment
| Component | Version |
|---|---|
| Application Server | v3.17.3 |
| Network Server | v3.15.3 |
After changing the go routine to send the XmitDataReq using a new context I realized that the hNS does not trigger the DownlinkTxAck flow after receiving a Success from fNS. The relevant spec, TS002-1.1.0_LoRaWAN_Backend_interfaces section 11.3.2, only says that the fNS should send info on DLFreq and result so it should be safe to assume that the frame has been sent if a success is received. Whether or not the frame was received by the GW and sent on air is surely a nice to have info but outside the BE spec.
I will look into a solution on this part and eventually submit a PR.
Thanks for reporting. I ran into the same issue last week. This was a regression that was caused by the txack handling which was added at a later point.
The above commit removes spawning the XmitDataReq within Go routine, as I'm not sure if it adds any value, it might actually be better to not run this async of the flow, as potentially some mac-commands are marked as requested within the device-session, while the downlink was never sent.
This at least solves the queue-item that stays in the queue forever. What this doesn't implement is the logging of the downlink transmission.
Brilliant, thanks!
I will perform additional roaming tests based on this last commit.
After some test and review I unfortunately think the async solution is needed. If we look at the example of stateless passive roaming, we end up in a non optimal solution:
- fNS sends PRStartReq including uplink
- sNS process uplink and sends XmitDataReq
- fNS process downlink and sends XmitDataAns
- sNs process answer and sends PRStartAns
In order to follow the spec (correct order should be 1, 4, 2, 3), the XmitDataReq needs to be sent async (or triggered after the PRStartAns has been sent) so that the PRStartReq flow can continue. Currently there is a very high risk of suffering from timeout in fNS on the PRStartReq request since it has to wait for the XmitDataReq to finish. The above flow results in even worse timeouts when async roaming is used due to the extra round-trips.
Raising the async_timeout will probably solve the timeout problems but it would be good to follow the spec here, other NS might expect that.