openfortivpn icon indicating copy to clipboard operation
openfortivpn copied to clipboard

read returned 0: after idle disconnect

Open ehaupt opened this issue 3 years ago • 21 comments

Environment:

  • FreeBSD 12.1-RELEASE-p10
  • OpenfortiVPN version: 1.15.0

Problem description:

I'm able to establish a VPN connection and establish bi-directional connections through the tunnel. Running openfortivpn with -v I can see debug messages like the this:

INFO:   Tunnel is up and running.
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  gateway ---> ppp (14 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  gateway ---> ppp (14 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  gateway ---> ppp (14 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  gateway ---> ppp (14 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  gateway ---> ppp (6 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  ppp ---> gateway (6 bytes)
DEBUG:  ppp ---> gateway (6 bytes)

This all works fine until I don't pass any data throug the tunnel. I suspect this is due to the server terminating the session after some idle timeout. I then receive the following looping messages until I hit ^C:

WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
WARN:   read returned 0
...

This does not occur if I make sure there is a constant flow of data through the tunnel (eg. running top on a remote machine).

While searching the issue tracker I was able to find a similar earlier issues:

  • https://github.com/adrienverge/openfortivpn/pull/132
  • https://github.com/adrienverge/openfortivpn/issues/137

Someone suggested to look for zombie pppd processes. In my case I was unable to detect any.

Running the openfortivpn with truss -f I was able to capture the following output right before the first output of read returned 0:

...
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: __sysctl(0x7fffffffcba0,0x2,0x7fffffffd8f0,0x7fffffffcb98,0x0,0x0) = 0 (0x0)
53699: getpid()                                  = 53699 (0xd1c3)
53699: sendto(7,"<30>1 2020-10-20T18:10:17.040870"...,125,0,NULL,0) = 125 (0x7d)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: getitimer(0,0x0)                          = 0 (0x0)
53699: setitimer(0,{ 0.000000, 0.600000 },0x0)   = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: ioctl(0,TIOCGETA,0x7fffffffdbe0)          = 0 (0x0)
53699: ioctl(0,TIOCSETA,0x7fffffffdbe0)          = 0 (0x0)
53699: __sysctl(0x7fffffffcbe0,0x2,0x7fffffffd930,0x7fffffffcbd8,0x0,0x0) = 0 (0x0)
53699: getpid()                                  = 53699 (0xd1c3)
53699: sendto(7,"<30>1 2020-10-20T18:10:17.042791"...,109,0,NULL,0) = 109 (0x6d)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: ioctl(0,TIOCGETA,0x7fffffffd650)          = 0 (0x0)
53699: ioctl(0,TIOCSETA,0x7fffffffd650)          = 0 (0x0)
53699: ioctl(0,TIOCFLUSH,0x7fffffffd68c)         = 0 (0x0)
53699: ioctl(0,TIOCSETAF,0x800b0d230)            = 0 (0x0)
53699: fcntl(0,F_GETFL,)                         = 6 (0x6)
53699: fcntl(0,F_SETFL,O_RDWR)                   = 0 (0x0)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: ioctl(0,TIOCGPGRP,0x7fffffffd6bc)         = 0 (0x0)
53699: getpgrp()                                 = 53699 (0xd1c3)
53699: close(0)                                  = 0 (0x0)
53698: select(6,{ 5 },0x0,0x0,0x0)               = 1 (0x1)
53699: sigprocmask(SIG_BLOCK,{ SIGALRM },{ })    = 0 (0x0)
53698: read(5,0x7fffdfffcf81,4095)               = 0 (0x0)
53699: setitimer(0,{ 0.000000, 0.000000 },0x0)   = 0 (0x0)
53698: write(1,"\^[[0;33mWARN:   read returned 0"...,31) = 31 (0x1f)
53699: sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGABRT|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ SIGALRM }) = 0 (0x0)
53699: sigaction(SIGALRM,{ SIG_IGN SA_RESTART ss_t },{ 0x800a29850 SA_RESTART|SA_SIGINFO ss_t }) = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ SIGALRM },0x0)  = 0 (0x0)
53699: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
53699: __sysctl(0x7fffffffc620,0x2,0x7fffffffd370,0x7fffffffc618,0x0,0x0) = 0 (0x0)
53698: write(1,"\^[[0;0m",6)                     = 6 (0x6)
53699: getpid()                                  = 53699 (0xd1c3)
53698: select(6,{ 5 },0x0,0x0,0x0)               = 1 (0x1)
53699: sendto(7,"<30>1 2020-10-20T18:10:17.047132"...,150,0,NULL,0) = 150 (0x96)
53698: read(5,0x7fffdfffcf81,4095)               = 0 (0x0)

ehaupt avatar Oct 21 '20 07:10 ehaupt

Let me comment shortly on this:

On FreeBSD the user space ppp client is used instead of pppd.

We have observed several times that "some traffic" is needed to keep the tunnel alive (e.g. by running a ping through the tunnel in the background).

I have tried to address the problem in #613 with an attempt to implement tcp keepalive. I'm not sure if this would have any effect at all on FreeBSD, and on Linux I had the impression that this didn't change much - and with recent commits from other pull requests we see that tcp keepalive (at least on linux) is already used on a lower layer...

mrbaseman avatar Oct 21 '20 09:10 mrbaseman

This is the code of interest: https://github.com/adrienverge/openfortivpn/blob/075b2ae/src/io.c#L209-L215

You could try to break (instead of continue). Perhaps read returning 0 after select returning 1 is not expected in a normal situation.

But then why hasn't this ever been a problem on Linux? Maybe there's a difference between Linux and FreeBSD here, perhaps select returns -1 when the connection is lost on Linux. I'd have to investigate what happens on Linux - unfortunately timeouts don't happen before 12 hours at least on my workstation and with the Fortigate device I have access to.

DimitriPapadopoulos avatar Oct 21 '20 09:10 DimitriPapadopoulos

Just read the comment from @mrbaseman. The issue you report occurs when reading from PPP :

n = read(tunnel->pppd_pty, &buf[off_w], PKT_BUF_SZ - off_w);

The difference here is probably related to the different PPP implementations between Linux and FreeBSD.

The best way to get rid of such issues is to embark PPP code directly in openfortivpn instead of spawning pppd. That won't happen in the short term though!

DimitriPapadopoulos avatar Oct 21 '20 09:10 DimitriPapadopoulos

...unfortunately timeouts don't happen before 12 hours at least on my workstation and with the Fortigate device I have access to.

but when the 12 hours timeout on the Fortigate happens (independent of traffic going through the tunnel or not), the connection is closed and the user has to re-authenticate. The scenario here is something different, that the tunnel doesn't work anymore when no traffic at all goes through the tunnel for some (shorter) time, maybe 15 minutes or so... that's why I was looking into tcp keepalive - maybe on FreeBSD it's disabled (in contrast to linux) and https://github.com/adrienverge/openfortivpn/pull/613/commits/56f074bd56241e2faa049f88c189f3d41cca7561 solves the problem at least on FreeBSD? I must admit, however, that on MacOS X, my colleague didn't notice any improvement with this commit

mrbaseman avatar Oct 21 '20 09:10 mrbaseman

@mrbaseman Mmmh... I'm pointing out there are two issues here:

  1. How and where is the TCP keepalive implemented on Linux and not in FreeBSD? Perhaps the different PPP implementations? You know better than me here.
  2. Why doesn't openfortivpn exit properly and keeps looping when it hits the timeout? Are you saying that in this scenario openfortivpn shouldn't exit?

DimitriPapadopoulos avatar Oct 21 '20 10:10 DimitriPapadopoulos

Thank you @mrbaseman and @DimitriPapadopoulos for the backgorund information.

Changing L215 to break; will terminate openfortivpn properly when the idle timeout disconnect is reached. In my case I'll have to specify --persistent=0 (default) as a re-connect would require me to enter a 2FA token.

I realize that this is not the ideal solution but IMHO far better than the process being caught in a loop.

I can't speak for Linux but for FreeBSD breaking at this point would probably be the better solution. Maybe a __ifdef __FreeBSD__ macro would be safer.

As my idle timeout happens much faster (it seems 5 minutes) I'd gladly test other approaches.

ehaupt avatar Oct 21 '20 10:10 ehaupt

I suspect break is the proper solution for Linux too - but I fear it might break openfortivpn because of other bugs/issues - so it's hard to apply such a change. Perhaps implementing it only for FreeBSD as you suggest is the best compromise in the short term. I'll nevertheless apply the change on my Linux computer and test a little bit.

@ehaupt I'm afraid don't understand why you are referring to the --persistent option:

  • Indeed in the current --persistent implementation I believe you need to re-authenticate. But what happens with --persistent=1? Does authentication fail as currently expected, or something horrible?
  • Future implementations of --persistent might skip re-authentication and just re-use the authentication cookie from the initial authentication. Might work.

DimitriPapadopoulos avatar Oct 21 '20 10:10 DimitriPapadopoulos

Sorry for the confusion. With break; and --persistent set to 1 it will try to re-connect without prompting for the 2FA code. And therefore fail to authenticate in a loop like so:

DEBUG:  waitpid: ppp exit status code 0
DEBUG:  ppp: Has detached, or otherwise the connection was successfully established and terminated at the peer's request.
INFO:   Terminated ppp.
INFO:   Closed connection to gateway.
DEBUG:  server_addr: xxx.xxx.xxx.xxx
DEBUG:  server_port: 443
DEBUG:  gateway_addr: xxx.xxx.xxx.xxx
DEBUG:  gateway_port: 443
DEBUG:  Setting cipher list to: HIGH:!aNULL:!kRSA:!PSK:!SRP:!MD5:!RC4
DEBUG:  Setting minimum protocol version to: 0x303.
DEBUG:  Gateway certificate validation succeeded.
INFO:   Logged out.
DEBUG:  Resolving gateway host ip
DEBUG:  Establishing ssl connection
DEBUG:  server_addr: xxx.xxx.xxx.xxx
DEBUG:  server_port: 443
DEBUG:  gateway_addr: xxx.xxx.xxx.xxx
DEBUG:  gateway_port: 443
DEBUG:  Setting cipher list to: HIGH:!aNULL:!kRSA:!PSK:!SRP:!MD5:!RC4
DEBUG:  Setting minimum protocol version to: 0x303.
DEBUG:  Gateway certificate validation succeeded.
INFO:   Connected to gateway.
ERROR:  Could not authenticate to gateway. Please check the password, client certificate, etc.
DEBUG:  HTTP status code 405
INFO:   Closed connection to gateway.
DEBUG:  server_addr: xxx.xxx.xxx.xxx
DEBUG:  server_port: 443
DEBUG:  gateway_addr: xxx.xxx.xxx.xxx
DEBUG:  gateway_port: 443
DEBUG:  Setting cipher list to: HIGH:!aNULL:!kRSA:!PSK:!SRP:!MD5:!RC4
DEBUG:  Setting minimum protocol version to: 0x303.
DEBUG:  Gateway certificate validation succeeded.
INFO:   Logged out.
DEBUG:  Resolving gateway host ip
DEBUG:  Establishing ssl connection
DEBUG:  server_addr: xxx.xxx.xxx.xxx
DEBUG:  server_port: 443
DEBUG:  gateway_addr: xxx.xxx.xxx.xxx
DEBUG:  gateway_port: 443
DEBUG:  Setting cipher list to: HIGH:!aNULL:!kRSA:!PSK:!SRP:!MD5:!RC4
DEBUG:  Setting minimum protocol version to: 0x303.
DEBUG:  Gateway certificate validation succeeded.
INFO:   Connected to gateway.
ERROR:  Could not authenticate to gateway. Please check the password, client certificate, etc.
DEBUG:  HTTP status code 405
INFO:   Closed connection to gateway.
DEBUG:  server_addr: xxx.xxx.xxx.xxx
DEBUG:  server_port: 443
DEBUG:  gateway_addr: xxx.xxx.xxx.xxx
DEBUG:  gateway_port: 443
DEBUG:  Setting cipher list to: HIGH:!aNULL:!kRSA:!PSK:!SRP:!MD5:!RC4
DEBUG:  Setting minimum protocol version to: 0x303.
DEBUG:  Gateway certificate validation succeeded.
INFO:   Logged out.
^C*** Error code 1

In the case of breaking when read returned 0 is received and --persistent=1 is set it should maybe prompt again for the 2FA code if previously entered code fails.

ehaupt avatar Oct 21 '20 11:10 ehaupt

@DimitriPapadopoulos

1. How and where is the TCP keepalive implemented on Linux and not in FreeBSD? Perhaps the different PPP implementations? You know better than me here.

Here I was referring to the socket options, i.e. the defaults, which the operating system or the ssl library sets on tunnel->ssl_socket. When I have tried to enable tcp keepalive I have set the SO_KEEPALIVE. It is available on FreeBSD The observation, however, was that this didn't help at least on MacOSX. So I didn't have a clear idea how I should proceed further in #613 . When a timeout has happened to me (on linux) some weeks ago, I have seen the error messages (probably coming from pppd ?), which I understand that something below already implements some sort of keepalive mechanism. Maybe instead of guessing what might be happening, we should just put in some lines of debug code which print out the result of getsockopt - and see on which platforms tcp-keepalive is already enabled and where it's not...

2. Why doesn't _openfortivpn_ exit properly and keeps looping when it hits the timeout? Are you saying that in this scenario _openfortivpn_ shouldn't exit?

This could be related to how the io_loop is implemented, the pthread stuff and semaphores... changing continue to break as suggested above is perhaps not a wise move. It means we terminate the ppp connection just because the read operation didn't return a packet. The read may be interrupted e.g. by a signal, There are short reads, even on regular files they are allowed according to POSIX, and on sockets I'd expect them to happen more frequently.

Also, how does the client know that a timeout has occurred if the server has taken that decision? We can just wait if more data is received some time later, or we decide that the connection has been lost. With --persistent we can try to reauthenticate, but with 2FA this has no advantage over a loop around openfortivpn (unless we can re-use the cookie, but I doubt that this is possible when the server has closed the session)

mrbaseman avatar Oct 22 '20 10:10 mrbaseman

This could be related to how the io_loop is implemented, the pthread stuff and semaphores... changing continue to break as suggested above is perhaps not a wise move. It means we terminate the ppp connection just because the read operation didn't return a packet. The read may be interrupted e.g. by a signal, There are short reads, even on regular files they are allowed according to POSIX, and on sockets I'd expect them to happen more frequently.

Also, how does the client know that a timeout has occurred if the server has taken that decision? We can just wait if more data is received some time later, or we decide that the connection has been lost. With --persistent we can try to reauthenticate, but with 2FA this has no advantage over a loop around openfortivpn (unless we can re-use the cookie, but I doubt that this is possible when the server has closed the session)

Agreed, but in this case:

  • first select() returns > 0 otherwise it would have hit break or continue first (lines 201-208),
  • then read() returns 0 (lines 209-215).

Is that expected from tunnel->pppd_pty? I thought not and supposed this could be specific to the case where PPP exits because of a time-out or other similar error conditions.

See for example return value of select() call:

These conditions do not include I/O errors on the file descriptors over which select() is watching. If a network condition is involved and a cable disappears, one of two things will happen:

  1. The I/O will hang. Since select() is designed to show you which I/O has actually completed (or can start), it cannot help you detect this, except if you use the timeout capability of the select().
  2. The I/O will error out. In this case, select() will indicate that reading is possible on the file descriptor (even if you have it open only for writing), at which point you should do the read, and you'll get an error on the read.

DimitriPapadopoulos avatar Oct 22 '20 11:10 DimitriPapadopoulos

Oh, I see. You are right. Unfortunately, I can't spend much time on this at the moment. Just one more thing that I have just noticed: select() provides a timeout functionality which we currently don't use (this might be interesting for the discussion because timeout was mentioned before). Currently we let all threads work until one of them (probably one of the ssl worker threads?) asks for termination (while the other ones are stuck in the process of waiting for IO, right?). Perhaps we can do this more gracefully, but I'm not sure how much work this would mean.

mrbaseman avatar Oct 22 '20 13:10 mrbaseman

The observation, however, was that this didn't help at least on MacOSX. So I didn't have a clear idea how I should proceed further in #613 . When a timeout has happened to me (on linux) some weeks ago, I have seen the error messages (probably coming from pppd ?), which I understand that something below already implements some sort of keepalive mechanism. Maybe instead of guessing what might be happening, we should just put in some lines of debug code which print out the result of getsockopt - and see on which platforms tcp-keepalive is already enabled and where it's not...

I have opened a separate issue for the addition of getsockopt() to debug: #795.

DimitriPapadopoulos avatar Oct 22 '20 14:10 DimitriPapadopoulos

@mrbaseman:

Here I was referring to the socket options, i.e. the defaults, which the operating system or the ssl library sets on tunnel->ssl_socket. When I have tried to enable tcp keepalive I have set the SO_KEEPALIVE. It is available on FreeBSD The observation, however, was that this didn't help at least on MacOSX.

Maybe it's worth giving SO_KEEPALIVE another try on FreeBSD. I'll report back.

ehaupt avatar Oct 22 '20 16:10 ehaupt

I've tried https://github.com/adrienverge/openfortivpn/commit/56f074bd56241e2faa049f88c189f3d41cca7561 on FreeBSD and saw the same result. I'm afraid, as you said, SO_KEEPALIVE didn't help on FreeBSD either.

ehaupt avatar Oct 23 '20 11:10 ehaupt

According to #795 SO_KEEPALIVE does not seem to be set by default on Linux:

SO_KEEPALIVE: 0
TCP_KEEPIDLE: 7200
TCP_KEEPINTVL: 75
TCP_KEEPCNT: 9

or FreeBSD:

SO_KEEPALIVE: 0
TCP_KEEPIDLE: 7200
TCP_KEEPINTVL: 75
TCP_KEEPCNT: 8

Yet it works fine on Linux without openfortivpn explicitly modifying these parameters. Strange.

  1. While reading the TCP(7) manual page, it occurred to me SO_SNDBUF and SO_RCVBUF might also be of interest. I'll modify #799 to print them too.
  2. I would suggest trying to set the following values, it looks like it's what FortiClient does on Linux:
SO_KEEPALIVE: 1
TCP_KEEPIDLE: 45
TCP_KEEPINTVL: 45

DimitriPapadopoulos avatar Oct 25 '20 11:10 DimitriPapadopoulos

I confirm the same issue on FreeBSD 12 and openfortivpn 1.17.1. No luck with PPP echo options (LQR or LCP) in /etc/ppp, they seem useless. SO_KEEPALIVE is 0, how should we try SO_KEEPALIVE: 1? @ehaupt does this patch test what @DimitriPapadopoulos suggests to try? Thank you in advance.

udanieli avatar Oct 27 '21 18:10 udanieli

Yes, I would give this patch a try.

You may also try OpenConnect. Fortinet support is a recent addition so you would need to compile the @master branch. If you do try OpenConnect, please do provide some feedback (here if it works or open an issue on the OpenConnect site if it doesn't work).

DimitriPapadopoulos avatar Oct 28 '21 05:10 DimitriPapadopoulos

I had a look at the source code and the patch. I submitted a PR #953 and pushed a branch with the SO_KEEPALIVE option if you want to test. TCP_KEEPIDLE and TCP_KEEPINTVL can be tweaked with sysctl, e.g.

# sysctl net.inet.tcp.keepintvl=27000

Like @ehaupt, I had no luck with SO_KEEPALIVE enabled, so I will try OpenConnect and see what happens. Well, I would like to test openfortivpn on a Linux machine first to see why it does not disconnect there.

udanieli avatar Oct 28 '21 11:10 udanieli

This is the code of interest: https://github.com/adrienverge/openfortivpn/blob/075b2ae/src/io.c#L209-L215

You could try to break (instead of continue). Perhaps read returning 0 after select returning 1 is not expected in a normal situation.

This was right on the money. read() returning 0 indicates EOF, so there is no reason to continue reading from it.

But then why hasn't this ever been a problem on Linux? Maybe there's a difference between Linux and FreeBSD here, perhaps select returns -1 when the connection is lost on Linux. I'd have to investigate what happens on Linux - unfortunately timeouts don't happen before 12 hours at least on my workstation and with the Fortigate device I have access to.

This is not a problem on Linux, because Linux returns -1 on the master fd of the pty when slave side fd has been closed. It's a bit odd, but based on cursory reading of POSIX, treating a returned 0 shouldn't cause any problems on Linux, either.

If you want to dig further into this than I did, here are some hints: https://lists.apple.com/archives/darwin-kernel/2010/Apr/msg00025.html

Koston-0xDEADBEEF avatar Jul 02 '23 16:07 Koston-0xDEADBEEF

I have this issue as well. Some of you have reached to obvious conclusions that read 0 is EOF from ppp process. TCP_KEEPALIVE wont work since its network stack only? It will never reach applications. I.e the fortgate server. Perhaps some application level keep alive instead? But since there is not support for that in fortigate.

Checking the ppp.log I get this info regarding why ppp disconnects: Aug 5 01:34:42 yoda ppp[83349]: tun0: Phase: Idle timer expired

Which is strange since I have a ping process though the tunnel.

mattiasbarthel avatar Aug 05 '23 05:08 mattiasbarthel

@mattiasbarthel Thank you for looking into it. This may mean it's a ppp issue. One more reason to try the -tun option which uses internal PPP code over a TUN device, instead of spawning pppd or ppp.

DimitriPapadopoulos avatar Aug 05 '23 10:08 DimitriPapadopoulos