callhome getting a timeout after 10 mins
Hi Michal,
I am implemented callhome feature where I am using API's of Libnetconf2. I am noticing that every 10 mins the device is getting disconnected . Can you tell me , if it can be made persistent i.e I dont want to get it disconnected. Also I want to know , in which of the files this 10 min timer is present , so that If I want to extend it , I can do .
PS: I am using latest libnetconf2 tag (v0.12-r2) , ".so" gets created with name libnetconf2.so.0.12.55
Thanks & Regards, Vishal
Hi Vishal, could you please provide some more information? What does the client/server prints, so you at least know whether the server terminates the session or the client? Also, you are hardly using the latest libnetconf2 tag, we have released version 1.1.3 with the same tag.
Regards, Michal
Hi Michal, At the server end we have set a callback which calls nc_ps_poll(nc_ps, -1, &ncs); after 10 mins nc_ps_poll returns "NC_PSPOLL_SESSION_TERM" and disconnects the callhome device
As the client we are using ODL , Logs at the ODL says : 6:55:02.764 INFO [nioEventLoopGroup-4-56] Removing Uri{_value=NodeCallHome3} from Netconf Topology. 16:55:02.764 WARN [nioEventLoopGroup-4-56] RemoteDevice{NodeCallHome3}: Session terminated Session closed 16:55:02.770 INFO [nioEventLoopGroup-4-56] Removing Uri{_value=NodeCallHome3} from Netconf Topology. 16:55:02.773 INFO [opendaylight-cluster-data-notification-dispatcher-185] Setting failed status for callhome device id:Uri{_value=NodeCallHome3}. 16:55:02.773 INFO [opendaylight-cluster-data-notification-dispatcher-181] Setting failed status for callhome device id:Uri{_value=NodeCallHome3}. 16:55:02.773 INFO [opendaylight-cluster-data-notification-dispatcher-184] Setting failed status for callhome device id:Uri{_value=NodeCallHome3}. 16:55:02.774 WARN [nioEventLoopGroup-4-56] RemoteDevice{NodeCallHome3}: Session went down java.io.IOException: End of input detected. Close the session. at org.opendaylight.netconf.nettyutil.AbstractNetconfSession.endOfInput(AbstractNetconfSession.java:94) [281:org.opendaylight.netconf.netty-util:1.6.1] at org.opendaylight.netconf.nettyutil.AbstractNetconfSession.channelInactive(AbstractNetconfSession.java:174) [281:org.opendaylight.netconf.netty-util:1.6.1] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) [104:io.netty.transport:4.1.34.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390) [99:io.netty.codec:4.1.34.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355) [99:io.netty.codec:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) [104:io.netty.transport:4.1.34.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390) [99:io.netty.codec:4.1.34.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355) [99:io.netty.codec:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1403) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext.access$300(AbstractChannelHandlerContext.java:38) [104:io.netty.transport:4.1.34.Final] at io.netty.channel.AbstractChannelHandlerContext$4.run(AbstractChannelHandlerContext.java:233) [104:io.netty.transport:4.1.34.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [101:io.netty.common:4.1.34.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) [101:io.netty.common:4.1.34.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495) [104:io.netty.transport:4.1.34.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) [101:io.netty.common:4.1.34.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [101:io.netty.common:4.1.34.Final] at java.lang.Thread.run(Thread.java:745) [?:?]
Regards, Vishal
Hi Vishal,
okay, it would seem the server is terminating the connection, could you enable some (-v2) messages? The server should print the reason for why it is disconnecting. But the only default 10 min timeout I found in the old code was hello timeout. If that is the one elapsed, there is a problem with the client, but it is too soon to say, please increase verbosity of the server and then I can help you more.
Regards, Michal
Hi Michal,
I am not using netopeer so -v2 i cant do But I have added verbosity to the logs using the api's 'nc_verbosity(NC_VERB_DEBUG)' and a callback function added using 'nc_set_print_clb()' Following are the logs after enabling verbose mode:
After 10 mins logs:
Jan 21 04:43:45 6 2 0 netconfAgent:NoName D [19:39:59] LibNetConfOriginator.cpp:264 Jan 21 04:43:45 6 2 0 netconfAgent:NoName D LIBNETCONF : Session 1: SSH channel poll error (Received SSH_MSG_DISCONNECT: 2:User session has timed out idling after 600000 ms.). Jan 21 04:43:45 6 2 0 netconfAgent:NoName D [19:39:59] LibNetConfOriginator.cpp:126 Jan 21 04:43:45 6 2 0 netconfAgent:NoName D Got nc_poll_sess_termination.Deleting and exiting the worker thread Jan 21 04:43:46 6 2 0 netconfAgent:NoName D [19:40:00] LibNetConfOriginator.cpp:105 Jan 21 04:43:46 6 2 0 netconfAgent:NoName D Inside free_ds
After a min again logs:
Jan 21 04:44:26 6 2 0 netconfAgent:NoName D LIBNETCONF : Session 1: Call Home thread failed to wake up in a timely manner, fatal synchronization problem. Jan 21 04:44:26 6 2 0 netconfAgent:NoName D [19:40:40] LibNetConfOriginator.cpp:264 Jan 21 04:44:26 6 2 0 netconfAgent:NoName D LIBNETCONF : Pthread condition timedwait failed (Operation not permitted). Jan 21 04:44:26 6 2 0 netconfAgent:NoName D [19:40:40] LibNetConfOriginator.cpp:264 Jan 21 04:44:26 6 2 0 netconfAgent:NoName D LIBNETCONF : Call Home client "OdlClient" thread exit
Inference
It seems that, the ssh is disconnecting because keepalive is not there which should tell the ssh to be online. So is there a way in libnetconf2 to send keepalive messages to ssh which will not stop the termination of connection?
version of libssh used - libssh-0.9.0
Thanks & Regards, Vishal
Hi Vishal,
I was right, this is not a libnetconf2 timeout but an SSH one. Keep-alive should be enabled for all the sockets (both server and client) but it is on TCP-level, this seems like a specific SSH one. You will have to configure this yourself. It may even be in the configuration of your distribution, libssh applies this configuration since version 0.9.0 (files in /etc/ssh).
Regards, Michal
Hi Michal,
As per your suggestion, and surfing the internet I found out that, we can add "ClientAliveInterval"/"ServerAliveInterval" option to /etc/ssh/sshd_config file so that the the session can be alive
Reference : - https://unix.stackexchange.com/questions/3026/what-options-serveraliveinterval-and-clientaliveinterval-in-sshd-config-exac
But this din't work at all . After looking through the libssh code(0.9.0) I found out that the above two options are not present and only following are the options available:
enum ssh_bind_config_opcode_e {
32 /* Known but not allowed in Match block */
33 BIND_CFG_NOT_ALLOWED_IN_MATCH = -4,
34 /* Unknown opcode */
35 BIND_CFG_UNKNOWN = -3,
36 /* Known and not applicable to libssh */
37 BIND_CFG_NA = -2,
38 /* Known but not supported by current libssh version */
39 BIND_CFG_UNSUPPORTED = -1,
40 BIND_CFG_INCLUDE,
41 BIND_CFG_HOSTKEY,
42 BIND_CFG_LISTENADDRESS,
43 BIND_CFG_PORT,
44 BIND_CFG_LOGLEVEL,
45 BIND_CFG_CIPHERS,
46 BIND_CFG_MACS,
47 BIND_CFG_KEXALGORITHMS,
48 BIND_CFG_MATCH,
49 BIND_CFG_PUBKEY_ACCEPTED_KEY_TYPES,
50 BIND_CFG_HOSTKEY_ALGORITHMS,
51
52 BIND_CFG_MAX /* Keep this one last in the list */
53 };
Also when I enabled the logs in libssh and redirected it to print it to file following are the results I got:
Unknown option: Protocol, line: 3
Unknown option: Banner, line: 4
Unknown option: RhostsRSAAuthentication, line: 5
Unknown option: HostbasedAuthentication, line: 6
Unknown option: PasswordAuthentication, line: 7
Unknown option: PermitEmptyPasswords, line: 8
Unknown option: ChallengeResponseAuthentication, line: 9
Unknown option: AllowTcpForwarding, line: 10
Unknown option: PidFile, line: 12
Unknown option: TCPKeepAlive, line: 13
Unknown option: ClientAliveCountMax, line: 14
Unknown option: ClientAliveInterval, line: 15
Unknown option: MaxStartups, line: 16
Unknown option: ForceCommand, line: 46
Count 46
Poll callback on socket 12 (POLLOUT ), out buffer 22
Received POLLOUT in connecting state
ssh_handle_key_exchange: current state : 2
Poll callback on socket 12 (POLLOUT ), out buffer 22
Enabling POLLOUT for socket
ssh_handle_key_exchange: current state : 2
Poll callback on socket 12 (POLLIN POLLOUT ), out buffer 0
Received banner: SSH-2.0-OpenSSH_5.3
SSH client banner: SSH-2.0-OpenSSH_5.3
Analyzing banner: SSH-2.0-OpenSSH_5.3
We are talking to an OpenSSH client version: 5.3 (50300)
kex algos: ecdh-sha2-nistp256,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1
server host key algo: rsa-sha2-512,rsa-sha2-256,ssh-rsa
encryption client->server: aes128-ctr,aes256-ctr,aes192-ctr,[email protected],[email protected],[email protected]
encryption server->client: aes128-ctr,aes256-ctr,aes192-ctr,[email protected],[email protected],[email protected]
mac algo client->server: hmac-sha2-512,hmac-sha2-256,[email protected],[email protected],hmac-sha1,[email protected]
mac algo server->client: hmac-sha2-512,hmac-sha2-256,[email protected],[email protected],hmac-sha1,[email protected]
compression algo client->server: none
compression algo server->client: none
languages client->server:
languages server->client:
packet: wrote [type=20, len=652, padding_size=5, comp=646, payload=646]
SSH_MSG_KEXINIT sent
Enabling POLLOUT for socket
ssh_handle_key_exchange: current state : 4
Poll callback on socket 12 (POLLIN POLLOUT ), out buffer 0
packet: read type 20 [len=868,padding=8,comp=859,payload=859]
Dispatching handler for packet type 20
kex algos: diffie-hellman-group-exchange-sha1
server host key algo: [email protected],[email protected],[email protected],[email protected],ssh-rsa,ssh-dss
encryption client->server: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
encryption server->client: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
mac algo client->server: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96
mac algo server->client: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96
compression algo client->server: none,[email protected],zlib
compression algo server->client: none,[email protected],zlib
languages client->server:
languages server->client:
Negotiated diffie-hellman-group-exchange-sha1,ssh-rsa,aes128-ctr,aes128-ctr,hmac-sha1,hmac-sha1,none,none,,
Set output algorithm aes128-ctr
Set HMAC output algorithm to hmac-sha1
Set input algorithm aes128-ctr
Set HMAC input algorithm to hmac-sha1
sending control flow event
sending channel_write_wontblock callback
ssh_handle_key_exchange: current state : 6
Poll callback on socket 12 (POLLOUT ), out buffer 0
sending control flow event
sending channel_write_wontblock callback
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
ssh_handle_key_exchange: current state : 6
Poll callback on socket 12 (POLLIN ), out buffer 0
packet: read type 34 [len=20,padding=6,comp=13,payload=13]
Dispatching handler for packet type 34
Every other website which I saw is saying the same thing to add those options . But those options are not available in libssh code at all. Can you pl help me on this , or if you can tell me whom should be the better person to ask this.
Thanks & Regards. Vishal
Hi Vishal, this quite interesting information for any future reference but like you said, you cannot set it in libssh so I do not think it is relevant for libnetconf2. My guess is that your NETCONF client (ODL) is the one disconnecting, which is supported by this message
Jan 21 04:43:45 6 2 0 netconfAgent:NoName D LIBNETCONF : Session 1: SSH channel poll error (Received SSH_MSG_DISCONNECT: 2:User session has timed out idling after 600000 ms.).
It says that it has failed to poll a channel because it has disconnected and it then prints out the reason that the client sent. So, you need to configure this timeout in your client somehow, it is not a libnetconf2 issue.
Regards, Michal
Hi Michal,
This is just an FYI . As you said that problem was not with libnetconf2 . libssh terminates the connection after 10 minutes if it doesnot gets response from client. Also the parameters like "ClientAliveInterval"/"ServerAliveInterval" is provided by openssh in /etc/config/sshd_config file and libssh is not using these as they term it as notsafe . So to keep the server alive one needs to send an unsolicited dummy message before the timeout to keep it alive . For this one can use ssh_send_ignore() api of libssh. I tried it and it worked for me .
For reference pl refer to the link : https://bugs.libssh.org/T212
Also to add to this , this issue of connection termination at fixed intervals(10 min) is with ODL client , for MGSoft client it works fine (without changing the code) and doesn't gets disconnected.
Thanks & Regards, Vishal
Hi Vishal, thanks for the information. So it is like I said and it seems to me the correct solution would be to change the ODL client configuration, right? I just want to make sure I understand the problem, you are probably not able to change anything in the client.
Regards, Michal
Hi Michal ,
In a way yes . I am not changing the the ODL client configuration . I don't have much info on what to change in ODL client for the same.
But still as per the RFC 8071 for netconf callhome
This states that : If a persistent connection is desired, the NETCONF/RESTCONF server, as the connection initiator, SHOULD actively test the aliveness of the connection using a keep-alive mechanism. For SSH-based connections, per Section 4 of [RFC4254], the server SHOULD send an SSH_MSG_GLOBAL_REQUEST message with a purposely nonexistent "request name" value(e.g. [email protected]) and the "want reply" value set to '1'.
So to do this it has to be a change in server side , which i think would be correct . Is there anything you can suggest me on this to implement it.
Regards, Vishal
Hi Vishal, libnetconf2 supports sending keep-alive messages but since the spec you quoted cannot be satisfied with the libraries we use (OpenSSL issue https://github.com/openssl/openssl/issues/4856, libssh does not provide the API as seen in your bug) to my knowledge, it uses TCP packets and it worked fine for most use cases. But it seems ODL has a really aggressive config and implements a separate SSH timeout, which needs SSH keep-alive so there is nothing we can do about that in libnetconf2. I suppose your solution of sending some other dummy SSH message is fine but I would prefer not to add it into the library. Also, it is the server that should send keep-alives based on the specs, not the client.
Regards, Michal
Hi, Michal!
ODL for netconf-callhome use (https://tools.ietf.org/html/rfc8071) in section-4.1 point S7 we looks:
If a persistent connection is desired, the NETCONF/RESTCONF server, as the connection initiator, SHOULD actively test the aliveness of the connection using a keep-alive mechanism. For TLS-based connections, the NETCONF/RESTCONF server SHOULD send HeartbeatRequest messages, as defined by [RFC6520]. For SSH-based connections, per Section 4 of [RFC4254], the server SHOULD send an SSH_MSG_GLOBAL_REQUEST message with a purposely nonexistent "request name" value (e.g., [email protected]) and the "want reply" value set to '1'.
I think it's netopeer2 problem (libSSH), not ODL. Сorrect me if I'm wrong.
Hi, well, I suppose, but like I tried to explain there is no way of implementing this keep-alive in libnetconf2 because libssh and OpenSSL do not provide API for this functionality.
However, note that the specification is talking about what a NETCONF server as the connection initiator should do, there is nothing about what the client should do. Obviously, it can expect to receive these messages because the server should send them, but that is still technically a proprietary functionality and I would expect it can at least be configured/turned off.
Regards, Michal
FYI: https://jira.opendaylight.org/browse/NETCONF-832 and https://jira.opendaylight.org/browse/NETCONF-681 have been fixed.
IMO: they could be related to this problem.
The first one has fixed call-home devices heart-beats so they are not disconnected after 10 minutes any more.