libnetconf2 icon indicating copy to clipboard operation
libnetconf2 copied to clipboard

call home error: socket couldn't be created

Open peter-summer opened this issue 5 years ago • 14 comments

In my environment, call home failed because of network problem. Following log is printed repeatedly.

Jan 3 12:22:12 localhost netopeer2-server[9606]: Could not set TCP_NODELAY socket option (Bad file descriptor). Jan 3 12:22:18 localhost netopeer2-server[9606]: Could not set TCP_NODELAY socket option (Bad file descriptor). Jan 3 12:22:24 localhost netopeer2-server[9606]: Could not set TCP_NODELAY socket option (Bad file descriptor).

For a short time, following log is started to be printed repeatedly.

Jan 3 16:21:49 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:50 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:51 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:52 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:53 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:54 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:55 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:56 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:57 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:58 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:21:59 localhost netopeer2-server[9606]: socket couldn't be created. Jan 3 16:22:00 localhost netopeer2-server[9606]: socket couldn't be created.

Then when the network problem is solved, the call home still failed and the above "socket couldn't be created" is still printed.

Finally,I stop and restart netopeer2-server, call home successed.

peter-summer avatar Jan 06 '20 07:01 peter-summer

Hi, you are using quite an old version of libnetconf2 and the problem may have been fixed by f886568e7386078226c549c37fe2dd0f11743ad7, so please try to update it.

Regards, Michal

michalvasko avatar Jan 06 '20 08:01 michalvasko

Hi, I have updated f886568. But the problem is still existed.The reason of socket failure is "Too many open files".

Jan 19 19:39:12 localhost netopeer2-server[9922]: Could not set TCP_NODELAY socket option (Bad file descriptor).
Jan 19 19:39:13 localhost netopeer2-server[9922]: Call Home client "controller3" was not found.
Jan 19 19:39:26 localhost netopeer2-server[9922]: Could not set TCP_NODELAY socket option (Bad file descriptor).
Jan 19 19:39:32 localhost netopeer2-server[9922]: Could not set TCP_NODELAY socket option (Bad file descriptor).
Jan 19 19:39:33 localhost netopeer2-server[9922]: Call Home client "controller1" was not found.
Jan 19 19:39:41 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:42 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:43 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:44 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:45 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:46 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:47 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:48 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:49 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)
Jan 19 19:39:50 localhost netopeer2-server[9922]: socket couldn't be created.(Too many open files)

Does it need to add ‘close(*sock_pending)’ in following code in function ‘nc_sock_connect’?

int
nc_sock_connect(const char *host, uint16_t port, int timeout, int *sock_pending, char **ip_host)
{
......

error:
    if (sock != -1) {
        close(sock);
    }
    if (sock_pending) {
        *sock_pending = -1;
    }
    return -1;
}

peter-summer avatar Jan 21 '20 02:01 peter-summer

Hi, what do you mean, the socket is now by closed, it was added in the commit you referenced. There quite a lot of errors in the output so I am not sure what exactly are you doing but with some more information I am not able to help you, the code seems fine to me and all unused sockets should be properly closed.

Regards, Michal

michalvasko avatar Jan 21 '20 08:01 michalvasko

Hi, I have exactly the same issue.

I started the call-home procedure when port 4334 was not opened on the client side and got following logs continuously printed:

[INF]: LN: Trying to connect via IPv4 to 192.168.111.3:4334.
[INF]: LN: Timed out after 5s (Operation now in progress).
[INF]: LN: Successfully connected to 192.168.111.3:4334 over IPv4.
[ERR]: LN: Could not set TCP_NODELAY socket option (Bad file descriptor).

After some time (it depends on the amount of files which could be opened by the one process), logs changed to

[ERR]: LN: Accept failed (Too many open files).
[ERR]: LN: Socket could not be created (Too many open files).

@michalvasko, you're absolutelly right that socket should not be closed if it was not even openned. But, I'm a bit suspicious of the following code:

    if (ret == 0) {
        /* there was a timeout */
        VRB("Timed out after %ds (%s).", timeout, strerror(errno));
        if (sock_pending) {
            /* no sock-close, we'll try it again */
            *sock_pending = sock;
        } else {
            close(sock);
        }
        return -1;
    }

It seems to me that it might be needed to close old pending socket before assigning new value to it:

    if (ret == 0) {
        /* there was a timeout */
        VRB("Timed out after %ds (%s).", timeout, strerror(errno));
        if (sock_pending) {
            /* no sock-close, we'll try it again */
             if (*sock_pending!= -1)                                               <============ added code
                  close(*sock_pending);                                             <============ added code
            *sock_pending = sock;
        } else {
            close(sock);
        }
        return -1;
    }

Anyway, problem definetelly exists... Shall I create new issue or we may continue here?

Thanks in advance.

jabilmplane avatar Mar 02 '21 10:03 jabilmplane

One more note.

In the netstat output there are duplicated items (client's ip address is 192.168.111.3) :

engineer@sub6:~$ netstat -a | grep 192.168.111.8
tcp        0      1 192.168.111.8:44910     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44874     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44898     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44914     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44884     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44888     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44894     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44880     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44904     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44906     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44908     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44900     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44876     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44892     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44872     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44886     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44882     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44896     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44912     192.168.111.3:4334      SYN_SENT
tcp        0      0 192.168.111.8:ssh       192.168.111.3:53440     ESTABLISHED
tcp        0      1 192.168.111.8:44890     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44902     192.168.111.3:4334      SYN_SENT
tcp        0      1 192.168.111.8:44878     192.168.111.3:4334      SYN_SENT
udp        0      0 192.168.111.8:ntp       0.0.0.0:*


jabilmplane avatar Mar 02 '21 10:03 jabilmplane

@michalvasko

I did some investigation and found that the root cause is similar to the one I mentioned. It is almost the same code but in the different place: end of the nc_sock_connect function. Please take a look at my diff:

spasichn@spasichn-VirtualBox:~/cesnet/libnetconf2/build$ git diff
diff --git a/src/session_client.c b/src/session_client.c
index 814f375..6e0cb24 100644
--- a/src/session_client.c
+++ b/src/session_client.c
@@ -1428,6 +1428,9 @@ error:
         close(sock);
     }
     if (sock_pending) {
+        if (*sock_pending != -1) {
+            close(*sock_pending);
+        }
         *sock_pending = -1;
     }
     return -1;

Seems that it works fine on my VM. Shall I create a pull request?

jabilmplane avatar Mar 02 '21 11:03 jabilmplane

First, what libnetconf2 version are you using? If it is not the latest devel, switch to it and try again. As for the fix, when error is being executed, the sock and sock_pending should be one socket I believe (looking at devel) so this fix does not really make sense. I would also like to know the exact reason for

[ERR]: LN: Could not set TCP_NODELAY socket option (Bad file descriptor).

michalvasko avatar Mar 02 '21 12:03 michalvasko

Correct, I use pretty old version. But I have checked diff between recent devel and code looks pretty similar.

jabilmplane avatar Mar 02 '21 12:03 jabilmplane

[ERR]: LN: Could not set TCP_NODELAY socket option (Bad file descriptor).

is printed when port 4334 is disabled on the client's side

jabilmplane avatar Mar 02 '21 12:03 jabilmplane

But I have checked diff between recent devel and code looks pretty similar.

Like I said, update it before we can talk about any fixes. I have heard this (and thought it myself) too many times.

when port 4334 is disabled on the client's side

But the message before says that is was successfully connected, how is that possible if the port is disabled? Actually, what do you mean by "disabled"? Is the client is not listening on it, it is impossible to connect.

michalvasko avatar Mar 02 '21 12:03 michalvasko

@michalvasko, you are right. It seems to work properly on the latest devel.

[INF]: LN: Trying to connect via IPv4 to 192.168.0.107:4334.
[INF]: LN: Timed out after 5s (Operation now in progress).
[INF]: LN: Trying to connect the pending socket 34.
[INF]: LN: Timed out after 5s (Operation now in progress).
[INF]: LN: Trying to connect the pending socket 34.
[INF]: LN: Timed out after 5s (Operation now in progress).
[INF]: LN: Trying to connect the pending socket 34.
[INF]: LN: Timed out after 5s (Operation now in progress).
[INF]: LN: Trying to connect the pending socket 34.
[INF]: LN: Timed out after 5s (Operation now in progress).
[ERR]: LN: Failed to connect socket 34 after 5 retries, closing.
[INF]: LN: Trying to connect via IPv4 to 192.168.0.107:4334.
[INF]: LN: Timed out after 5s (Operation now in progress).
[INF]: LN: Trying to connect the pending socket 34.
[INF]: LN: getsockopt() error (Connection refused).
[INF]: LN: Trying to connect via IPv4 to 192.168.0.107:4334.
[INF]: LN: getsockopt() error (Connection refused).
[INF]: LN: Trying to connect via IPv4 to 192.168.0.107:4334.
[INF]: LN: getsockopt() error (Connection refused).
[INF]: LN: Trying to connect via IPv4 to 192.168.0.107:4334.
[INF]: LN: getsockopt() error (Connection refused).
[ERR]: LN: Failed to connect socket -1 after 5 retries, closing.
[INF]: LN: Trying to connect via IPv4 to 192.168.0.107:4334.
[INF]: LN: getsockopt() error (Connection refused).

Disabled port means that it is disabled in the linux.

jabilmplane avatar Mar 02 '21 12:03 jabilmplane

Disabled port means that it is disabled in the linux.

That is not too helpful but I would guess the effect of that is that no process can listen on the port. But that did not occur in your case because the socket was connected successfully.

Or you are saying you have disabled the port right after it connected? That would make some sense and it would also explain why in your last output you first do connect and not the next time.

michalvasko avatar Mar 02 '21 15:03 michalvasko

It is possible to configure some rules using "iptables" linux tool. In my configuration, if I remove rules for port 4334, any connection to this port will be dropped. But ip address is still reachable. In this scenario, old implementation of netopeer2-server is able to open the socket but fails to set TCP_NODELAY option.

-P INPUT DROP
-P FORWARD DROP
-P OUTPUT ACCEPT
-A INPUT -p tcp -m tcp --sport 4334 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 4334 -j ACCEPT

jabilmplane avatar Mar 02 '21 15:03 jabilmplane

You could have mentioned iptables right away (I would use the term blocked instead of disabled). OK, I see.

michalvasko avatar Mar 03 '21 06:03 michalvasko