iroh
iroh copied to clipboard
Too many `failed to open disco box` in logs
Having a lot of lines
2024-04-17T17:09:24.159112Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=<my_node_ip>:11204 count=1 len=106 me=...
2024-04-17T17:09:24.306317Z WARN poll_recv:disco_in: failed to open disco box err=decryption failed: Error
On some nodes (perhaps with poor network connectivity, it is the only that I see in logs).
2024-04-17T17:17:54.816564Z WARN poll_recv:disco_in: failed to open disco box err=decryption failed: Error me=... node=... src=Udp(myip:11204)
2024-04-17T17:17:55.454617Z INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:55.454662Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
2024-04-17T17:17:55.507406Z INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:55.507434Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
2024-04-17T17:17:56.141233Z INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.141281Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=30 me=...
2024-04-17T17:17:56.141302Z INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.141313Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=30 me=...
2024-04-17T17:17:56.257519Z INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.257564Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
2024-04-17T17:17:56.283399Z INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.283440Z WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
Two questions about the setup in which these errors occur:
- Do the nodes in your setup all have stable node ids, or are some nodes that could get a new node id on each restart?
- Do nodes restart a lot?
Reason I'm asking is because after first observation, the first scenario I can see this being hit is like this:
- Machine A: Node with id N1, port P1
- Machine B: Node with id N2, port P2
- N2 dials N1 via ticket
- N2 --discoping--> N1 via P1
- connection established
- Machine A: restart
- Machine A: Node starts, id N3, port P1
- N2 --discoping--> N1 via P1
- Node N3 received disco packet directed at N1, triggering the CryptoBox error
- Node N2 does not get a reply from N3, keeps sending "bad" disco packets until timeout
I will still have to investigate how this changes if a relay is present.
Do the nodes in your setup all have stable node ids, or are some nodes that could get a new node id on each restart?
Something in the middle ground. Nodes are not getting new IDs every time, but during debugging and development I had re-inited nodes on the same servers several times. The approximate numbers are following: for 7 alive and functional nodes there are approximately 7-10 dead nodes that had existed and had different or same IPs. Most alive nodes had ever connected to the already dead ones.
Do nodes restart a lot?
Yes, may be one time per 6-24 hour. But they are not receiving new IDs at restart.
UPD.: Peer store is enabled. So I have there list of all dead peers, seems it is not pruned. Not sure if it is realted, but jfyi
This seems to have been addressed by #2123, is this still happening?
#2333 should mitigate this to a great extent. The reason behind this is disco messages that are meant for dead nodes that used to run on the same port as the current one, plus propagation of stale info. Easiest way to kickstart this process is by removing the keypair file at any point while keeping alive nodes that used to communicate with the old one.
From testing, this is simple to reproduce and #2333 completely prevents this is testing. Could you give it a go?
Yes, I can confirm that the number of these lines decreased! Thank you