YSFClients icon indicating copy to clipboard operation
YSFClients copied to clipboard

YSFGateway & ircDDBGateway high reconnect rate on APRS-IS servers

Open hessu opened this issue 6 years ago • 6 comments

Hi,

It seems to me that the YSFGateway & ircDDBGateway have one or two issues which might end up killing some APRS-IS servers at some point, or getting some people annoyed. T2TEXAS is currently getting some 50 to 70 new APRS-IS client connections per second, all from a small bunch (~10) of these YSF/ircDDBGateways.

It seems that there is more than one instance configured with the same callsign-SSID (such as N0CALL-G), possibly due to a config issue or a restart/start gone wrong and starting a duplicate.

The APRS-IS server will (rightly) accept only one client using the same callsign-SSID, and disconnect the previous client. The previous client, which was already logged in, reconnects immediately after getting disconnected without any sort of delay, which again causes the other connection to be kicked out.

  • There should be a reconnect delay timer, preferably with exponential backoff (& a higher fixed limit to it), if the client gets disconnected soon after a successful connection. If I read the code right, it currently only has a delay timer if the connect/login fails.
  • There might be need for stronger protection against duplicate instances (pid file with flock() or other locking mechanism).

Here's how the server log looks like (a short sample of a very fast-growing log):

2019/11/21 15:03:57.818066 aprsc-texas[513:7fec33e1d700] INFO: fd 144: Disconnecting duplicate validated client with username 'K5VPW-R' 2019/11/21 15:03:57.818097 aprsc-texas[513:7fec33e1d700] INFO: fd 236: Disconnecting duplicate validated client with username 'AF5EH-G' 2019/11/21 15:03:57.841881 aprsc-texas[513:7fec33e1d700] INFO: fd 403: Disconnecting duplicate validated client with username 'W9LL-G' 2019/11/21 15:03:57.867334 aprsc-texas[513:7fec33e1d700] INFO: fd 376: Disconnecting duplicate validated client with username 'WX5WDG-N' 2019/11/21 15:03:57.867361 aprsc-texas[513:7fec33e1d700] INFO: fd 366: Disconnecting duplicate validated client with username 'KD9JSX-N' 2019/11/21 15:03:57.870035 aprsc-texas[513:7fec33e1d700] INFO: fd 142: Disconnecting duplicate validated client with username 'N7SGT-N' 2019/11/21 15:03:57.881081 aprsc-texas[513:7fec33e1d700] INFO: fd 385: Disconnecting duplicate validated client with username 'KG5EIU-G' 2019/11/21 15:03:57.897718 aprsc-texas[513:7fec33e1d700] INFO: fd 405: Disconnecting duplicate validated client with username 'N5JFP-G' 2019/11/21 15:03:57.911954 aprsc-texas[513:7fec33e1d700] INFO: fd 182: Disconnecting duplicate validated client with username 'KG5EIU-G' 2019/11/21 15:03:57.986255 aprsc-texas[513:7fec33e1d700] INFO: fd 172: Disconnecting duplicate validated client with username 'N9PMR-G' 2019/11/21 15:03:57.986953 aprsc-texas[513:7fec33e1d700] INFO: fd 142: Disconnecting duplicate validated client with username 'N9PMR-G'

Screenshot 2019-11-21 at 15 13 10

hessu avatar Nov 21 '19 19:11 hessu

The reconnect loop will only happen if the duplicate clients connect to the same APRS-IS server (i.e. they're not using the rotate). If they end up on different servers, it'll just lead to partial packet loss for packets sent by those two clients.

hessu avatar Nov 21 '19 20:11 hessu

Ping?

hessu avatar Mar 03 '20 09:03 hessu

Hi Hessu. I've just done a change that adds a backoff timer to the APRS reporting of the YSF data, which starts with a backoff of 1 minute, then 2, up to 10 and then stays at 10. If the system is able to log in then that is reset to 1 minute for the next reconnect attempt. I've not been able to test it, but I will now try and roll it out to the NXDN Gateway and the ircDDB Gateway which also need the same behaviour.

g4klx avatar Mar 03 '20 15:03 g4klx

Excellent, thanks. These gateways are very popular and have contributed with a very significant growth in the number of APRS-IS clients, so their behaviour is important. I tried to check the C++ sources today but couldn't quite be sure: They do a new DNS lookup every time they reconnect, right? To support DNS load balancing right.

hessu avatar Mar 03 '20 16:03 hessu

Yes they do a new DNS lookup for each connect. It's done within the TCP socket class. I am sure if the new code contains bugs, someone will let me know very quickly :-)

g4klx avatar Mar 03 '20 16:03 g4klx

Jonathan, I'm sorry, but I have to ask you to look into this again.

The issue is that there are two clients connecting to the server using the same callsign-SSID (like OH7LZB-G). When the second client connects, the first one is kicked out.

Both of the clients do get logged in successfully, and the disconnection happens later on when the other clients logs in.

The main question is: How to prevent, strongly, more than one instance of the gateway process from running at the same time on the same system with the same callsign-SSID. I just got a report from one server operator seeing this again; he sees 4 different YSF gateways connecting to a single server, some 3-5 reconnections per second.

hessu avatar Apr 03 '20 20:04 hessu