aprx icon indicating copy to clipboard operation
aprx copied to clipboard

aprx fails to try all addresses in a DNS round-robin reply

Open hessu opened this issue 10 years ago • 9 comments

I'm guessing this bug was introduced in commit 0e5c89e250567c.

I have a hostname with a couple of IPv4 addresses and an IPv6 address: igates.aprs.fi has address 85.188.1.122 igates.aprs.fi has address 193.185.39.21 igates.aprs.fi has IPv6 address 2001:67c:15c:1::122

I'm running aprx on an RPi which does not have IPv6 available locally. 193.185.39.21 happens to be down right now, 85.188.1.122 would be up and serve customers just fine (I can telnet igates.aprs.fi 14501 from the RPi).

aprx (current git master commit 79201f8a3a0e0) fails to try all addresses in the round-robin. It seems to try 193.185.39.21, it fails (unfortunately does not log the IP address being attempted!). Then it tries the IPv6 one, which fails (no IPv6 available currently), and it does not try 85.188.122.

2015-08-02 11:21:20.299 CLOSE APRSIS igates.aprs.fi:14580 fail on connect 2015-08-02 11:21:20.300 FAIL - Connect to igates.aprs.fi:14580 failed: socket formation failed - errno=97 - Address family not supported by protocol 2015-08-02 11:21:20.300 CLOSE APRSIS igates.aprs.fi:14580 reconnect 2015-08-02 11:23:29.259 CLOSE APRSIS igates.aprs.fi:14580 fail on connect 2015-08-02 11:23:29.259 FAIL - Connect to igates.aprs.fi:14580 failed: socket formation failed - errno=97 - Address family not supported by protocol 2015-08-02 11:23:34.261 CLOSE APRSIS igates.aprs.fi:14580 reconnect 2015-08-02 11:25:41.739 CLOSE APRSIS igates.aprs.fi:14580 fail on connect 2015-08-02 11:25:41.739 FAIL - Connect to igates.aprs.fi:14580 failed: socket formation failed - errno=97 - Address family not supported by protocol 2015-08-02 11:25:46.745 CLOSE APRSIS igates.aprs.fi:14580 reconnect

hessu avatar Aug 02 '15 12:08 hessu

Reverting commit 0e5c89e helped, now aprx gets connected to APRS-IS eventually.

hessu avatar Aug 02 '15 12:08 hessu

What happens if you configure your Pi to prefer IPv4 addresses over IPv6 ones? http://askubuntu.com/questions/32298/prefer-a-ipv4-dns-lookups-before-aaaaipv6-lookups

PhirePhly avatar Aug 02 '15 19:08 PhirePhly

Please test commit 3b83e3b4995becb07dbe0925a61a5197f3aa825c to see if this has resolved your issue.

PhirePhly avatar Aug 03 '15 01:08 PhirePhly

Any feedback on if this has resolved the issue?

PhirePhly avatar Aug 13 '15 03:08 PhirePhly

Reviewing this, it looks correct. The previous code, set to the first one, then incremented to the second one, then processed them in the loop.

patrickdk77 avatar Aug 29 '15 04:08 patrickdk77

Still not very good. It tries the first address, waiting a very long time for the connection to succeed, then it tries the same address again! It takes quite a while before it actually tries the second address.

It should rather move on to the next address right after the first one times out. If all the addresses fail, then maybe sleep a bit before retrying (exponential backoff).

2015-08-29 14:07:17.656 aprx start - 2.8.2-7-g7ddc5f5 2015-08-29 14:08:22.183 CLOSE APRSIS igates.aprs.fi:14580 time_reset! 2015-08-29 14:08:32.194 CLOSE APRSIS igates.aprs.fi:14580 reconnect 2015-08-29 14:10:39.743 CLOSE APRSIS igates.aprs.fi:14580 time_reset! 2015-08-29 14:10:49.754 CLOSE APRSIS igates.aprs.fi:14580 reconnect 2015-08-29 14:12:57.183 CLOSE APRSIS igates.aprs.fi:14580 time_reset! 2015-08-29 14:13:07.194 CLOSE APRSIS igates.aprs.fi:14580 reconnect 2015-08-29 14:15:14.613 CLOSE APRSIS igates.aprs.fi:14580 time_reset! 2015-08-29 14:15:27.949 CLOSE APRSIS igates.aprs.fi:14580 reconnect 2015-08-29 14:15:34.193 CONNECT APRSIS igates.aprs.fi:14580

hessu avatar Aug 29 '15 14:08 hessu

It's working as expected to me, confirmed from logging and tcpdump.

I am concerned about the TIME JUMPS though, not sure if those are expected, but I would not think so.

Now, on each dns lookup attempt, it does wait 120seconds till the next one. So each server line will take 2min to cycle though, and the timeout per attempt is not defined, left to the os, this should be changed. There are two loops currently, The loop through the server list, every 120seconds, and the loop through the dns lookup results from each server list.

interface_store() aif->callsign = 'APRSIS' erlang_add(APRSIS, RX, 0, 0) /etc/aprx.conf:33: MYCALL = 'N0CALL-1' '' /etc/aprx.conf:54: INFO: PASSCODE = '-1' '' /etc/aprx.conf:69: INFO: SERVER = 'igates.aprs.fi':'14580' /etc/aprx.conf:100: INFO: PIDFILE = '/var/run/aprx.pid' '' /etc/aprx.conf:107: INFO: RFLOG = '/var/log/aprx/aprx-rf.log' '' /etc/aprx.conf:114: INFO: APRXLOG = '/var/log/aprx/aprx.log' '' set 1 defined with 0 entries netresolve nrcount=0 telemetry_start() erlang_timer_init() to be called erlang_timer_init() to be called aprsis connection attempt IPv4 address: 85.188.1.122 aprsis connection failed. aprsis connection attempt IPv4 address: 193.185.39.21 aprsis connection failed. aprsis connection attempt IPv6 address: 2001:67c:15c:1::122 aprsis connection failed. MONOTONIC TIME JUMPED FORWARD BY 137.487 SECONDS. ttcallcount=9 mypid=24410 aprsis connection attempt IPv4 address: 85.188.1.122 aprsis connection failed. aprsis connection attempt IPv4 address: 193.185.39.21 aprsis connection failed. aprsis connection attempt IPv6 address: 2001:67c:15c:1::122 aprsis connection failed. MONOTONIC TIME JUMPED FORWARD BY 129.979 SECONDS. ttcallcount=12 mypid=24410 aprsis connection attempt IPv4 address: 85.188.1.122

patrickdk77 avatar Aug 29 '15 19:08 patrickdk77

Ok, I implemented a connection timeout (set to 20seconds for testing). I have no ipv6 on this testing station at the moment (haven't set it back up since the move).

All seems to be functional. The time jumps are related to the timeout, 40seconds excactly from the two 20sec ipv4 timeouts. I'll get these pushed up. But I am concerned with the monotonic time jump, have to dig into that now.

aprsis connection attempt IPv4 address: 85.188.1.122 aprsis connection failed. aprsis connection attempt IPv4 address: 193.185.39.21 aprsis connection failed. aprsis connection attempt IPv6 address: 2001:67c:15c:1::122 MONOTONIC TIME JUMPED FORWARD BY 40.032 SECONDS. ttcallcount=9 mypid=25465 In time_reset mode, no touching yet! aprsis connection attempt IPv4 address: 193.185.39.21 aprsis connection failed. aprsis connection attempt IPv4 address: 85.188.1.122 aprsis connection failed. aprsis connection attempt IPv6 address: 2001:67c:15c:1::122 MONOTONIC TIME JUMPED FORWARD BY 40.058 SECONDS. ttcallcount=16 mypid=25465

patrickdk77 avatar Aug 29 '15 22:08 patrickdk77

Ok, that monotonic time jump is the problem that is causing it.

After it finally makes a connection to a server, it then checks if how long it took to connect is >30seconds, if it is, it aborts the finally made new connection, in order to fix the >30second time jump.

So it all depends on dns ordering if it ever will connect.

    if (A->server_socket < 0)
            goto fail_out;

    freeaddrinfo(ai);
    ai = NULL;


    timetick(); // unpredictable time since system did last poll..

    if (time_reset) {
      if (debug) printf("In time_reset mode, no touching yet!\n");
      A->next_reconnect = tick.tv_sec + 10;
      return;
    }

    aprxlog("CONNECT APRSIS %s:%s",
            A->H->server_name, A->H->server_port);

Not sure on the proper way to fix it, but I feel just removing the return on time_reset will be perfectly fine. Really to me though, this while if(time_reset) block is not needed, as the code only reaches this point if a connection is made.

So I think something like this makes more sense:

    timetick(); // unpredictable time since system did last poll..

    if (time_reset) {
      if (debug) printf("In time_reset mode, no touching yet!\n");
      A->next_reconnect = tick.tv_sec + 10;
    }

    if (A->server_socket < 0)
            goto fail_out;

    freeaddrinfo(ai);
    ai = NULL;

    aprxlog("CONNECT APRSIS %s:%s",
            A->H->server_name, A->H->server_port);

patrickdk77 avatar Sep 16 '16 03:09 patrickdk77