iroh icon indicating copy to clipboard operation
iroh copied to clipboard

Too many `failed to open disco box` in logs

Open ppodolsky opened this issue 1 year ago • 2 comments

Having a lot of lines

2024-04-17T17:09:24.159112Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=<my_node_ip>:11204 count=1 len=106 me=...
2024-04-17T17:09:24.306317Z  WARN poll_recv:disco_in: failed to open disco box err=decryption failed: Error

On some nodes (perhaps with poor network connectivity, it is the only that I see in logs).

2024-04-17T17:17:54.816564Z  WARN poll_recv:disco_in: failed to open disco box err=decryption failed: Error me=... node=... src=Udp(myip:11204)
2024-04-17T17:17:55.454617Z  INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:55.454662Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
2024-04-17T17:17:55.507406Z  INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:55.507434Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
2024-04-17T17:17:56.141233Z  INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.141281Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=30 me=...
2024-04-17T17:17:56.141302Z  INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.141313Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=30 me=...
2024-04-17T17:17:56.257519Z  INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.257564Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...
2024-04-17T17:17:56.283399Z  INFO poll_recv: receive_udp: no node_map state found for addr, ignore src=myip:11204 me=...
2024-04-17T17:17:56.283440Z  WARN poll_recv: UDP recv quic packets: no node state found, skipping src=myip:11204 count=1 len=33 me=...

ppodolsky avatar Apr 17 '24 17:04 ppodolsky

Two questions about the setup in which these errors occur:

  • Do the nodes in your setup all have stable node ids, or are some nodes that could get a new node id on each restart?
  • Do nodes restart a lot?

Reason I'm asking is because after first observation, the first scenario I can see this being hit is like this:

  • Machine A: Node with id N1, port P1
  • Machine B: Node with id N2, port P2
  • N2 dials N1 via ticket
  • N2 --discoping--> N1 via P1
  • connection established
  • Machine A: restart
  • Machine A: Node starts, id N3, port P1
  • N2 --discoping--> N1 via P1
  • Node N3 received disco packet directed at N1, triggering the CryptoBox error
  • Node N2 does not get a reply from N3, keeps sending "bad" disco packets until timeout

I will still have to investigate how this changes if a relay is present.

Frando avatar Apr 17 '24 23:04 Frando

Do the nodes in your setup all have stable node ids, or are some nodes that could get a new node id on each restart?

Something in the middle ground. Nodes are not getting new IDs every time, but during debugging and development I had re-inited nodes on the same servers several times. The approximate numbers are following: for 7 alive and functional nodes there are approximately 7-10 dead nodes that had existed and had different or same IPs. Most alive nodes had ever connected to the already dead ones.

Do nodes restart a lot?

Yes, may be one time per 6-24 hour. But they are not receiving new IDs at restart.

UPD.: Peer store is enabled. So I have there list of all dead peers, seems it is not pruned. Not sure if it is realted, but jfyi

ppodolsky avatar Apr 18 '24 05:04 ppodolsky

This seems to have been addressed by #2123, is this still happening?

divagant-martian avatar May 20 '24 15:05 divagant-martian

#2333 should mitigate this to a great extent. The reason behind this is disco messages that are meant for dead nodes that used to run on the same port as the current one, plus propagation of stale info. Easiest way to kickstart this process is by removing the keypair file at any point while keeping alive nodes that used to communicate with the old one.

From testing, this is simple to reproduce and #2333 completely prevents this is testing. Could you give it a go?

divagant-martian avatar Jun 12 '24 16:06 divagant-martian

Yes, I can confirm that the number of these lines decreased! Thank you

ppodolsky avatar Jun 12 '24 17:06 ppodolsky