libnetconf icon indicating copy to clipboard operation
libnetconf copied to clipboard

libnetconf in deadlock during handshake

Open ntadas opened this issue 8 years ago • 9 comments

Hi

I have lib netconf configure with ssh disabled and TLS disabled and I'm using netopeer to connect to the server.

Sometimes during the capability exchange the server enters a endless loop. This endless look is inside the method nc_session_read_until So loop starts with if (session->fd_input != -1) and its able to read: <?xml version="1.0" encoding="UTF-8"?> <hello xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"> <capabilities> <capability>urn:ietf:params:netconf:base:1.0</capability> <capability>urn:ietf:params:netconf:base:1.1</capability> .some specific model capabilities...... <capa it stops in the characters I've written above and the read starts returning -1 the code in this method is: c = read (session->fd_input, &(buf[rd]), 1); if (c == -1) { if (errno == EAGAIN) { usleep (NC_READ_SLEEP); continue; } so it will continue to try to read data and it will always continue without any break condition.

Actually I have 2 issues here: 1- the infinity loop, I think this should by fixed 2- why the read fails in the middle of the capabilities exchange, this I don't have any idea.

Best Regards

ntadas avatar Mar 02 '16 07:03 ntadas

Hi, what is your setup again? You have libnetconf (the newest version I assume) with SSH and TLS disabled and are using netopeer-cli to connect to netopeer-server? You would not be able to compile netopeer-server with libnetconf that has both SSH and TLS disabled, so please elaborate, thank you.

Regards, Michal

michalvasko avatar Mar 02 '16 07:03 michalvasko

Hi I'm using the latest code from libnetcon, dowloaded yesterday night I have my own server, following the instructions on the libnetconf site (libnetconf compiled without TLS and without SSH) I'm using a custom database. I have netopeer-cli as a client.

most of the times I'm able to connect to the server do the get, get-config, edit-config etc... but sometime I have the problem described above.

ntadas avatar Mar 02 '16 07:03 ntadas

Hi, in that case it is quite difficult for us to help you, we can only guess what the problem might be. But it loops indefinitely because it awaits more data on a non-blocking socket, which are lost somewhere (or never sent), it seems. I don't think I can help you more.

Regards, Michal

michalvasko avatar Mar 02 '16 08:03 michalvasko

Hi

but shouldn't this loop have a timeout? what can I provided you more so that you can try to help in this issue?

Regards

ntadas avatar Mar 02 '16 08:03 ntadas

Where the file descriptor came from? In the case the libnetconf doesn't have SSH nor TLS I guess you have some standalone SSH/TLS server that resend data through this file descriptor to your NETCONF server (libnetconf), right? You should investigate this connection.

rkrejci avatar Mar 02 '16 08:03 rkrejci

yes, I'm using the ssh from my linux machine and I have a small ssh subsystem that only connects the input to the output and vice versa, so that the client and the server can talk. For this test I'm running the server and the client in the same machine, so I'm doing an ssh to localhost and the subsystem is talking with the server via afunix. I don't think the problem is in the connect, since is a local connection (but I'll investigate this also).

Independently from what is causing this (of course I still need to find it) I think the server shouldn't stay in an infinity loop. when this happens the connection thread will be blocked forever and no one else will be able to connect to the server.

ntadas avatar Mar 02 '16 08:03 ntadas

I actually agrre. The thing is that the timeout in nc_session_recv_*() functions is intended for waiting for data. And here part of the data already came. But the client (SSH subsystem in your case) did not sent a complete NETCONF message. It is actually kind of DoS attack, which comes from an authenticated client. The problem with any timeout here is the case of a slow connections (satelite) where we can have quite a big delay here so the timeout must be longer than timeouts in nc_session_recv_*() functions..

So, our proposed solution is to add a separate timeout, 30 sec by default, configurable via configure script (so constant for the compiled libnetconf). The timeout is reset whenever a data is received (so the situation can repeat during receiving a single message and the delay can be much longer, but in that case it is the problem of the connection).

What do you think?

rkrejci avatar Mar 02 '16 08:03 rkrejci

Seems reasonable. I'll continue my investigation to try to find why the message stops before it's finished. When I find something I'll post it here. thanks

ntadas avatar Mar 02 '16 08:03 ntadas

I've found the issue of the missing data: its was a problem in my application, in resume a thread was blocked and we couldn't receive all the data. The infinite loop is not so critical now for me, but nevertheless I think it should be fixed if possible to avoid other issues. Thanks for your fast support.

Regards

ntadas avatar Mar 02 '16 10:03 ntadas