logs flooded by error message ERR("%s: failed to wait for a pollsession condition (%s).", func, strerror(ret));
hi , we are using latest Netopeer and sysrepod and libnetconf2-0.11-r1 stack , after certain number of netconf messages , the Netopeer server goes in to some abnormal state printing this error message continously and never recovers from the situation. we are going through the code but not able to understand what is happening, under what circumstance the server runs in to this situation. Any pointers and help would be of great help.
ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out). ERROR: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Hi,
this was most likely fixed already and you can check that by using the latest version 0.12.51. Otherwise, you can try at least the latest release (0.12-r1) but that may not help at all. Anyway, I can actually look for a problem in the code only if you can reproduce this error in the latest version.
Regards, Michal
we applied a patch from https://github.com/CESNET/libnetconf2/pull/127/files/629dfc4a608633909ad891da8739fb0857103fb3 , will close this after some soaking period.
Closing the issue as we are no more seeing this in our setup
Hi, we are still seeing this issue albeit it happens very rarely once in 2 weeks, we feel that there is some race going on between the workers. netopeer server falls in to a irrecoverable situation after few ssh disconnect errors. I am pasting the logs below. I applied the patch mentioned in the above comment but still we are seeing the issue. Please provide some suggestions on the queue and its mechanics. Thanks & regards, -rk
Feb 23 01:53:14 netopeer2-server[3071]: Session 42: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:14 netopeer2-server[3071]: Session 41: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:14 netopeer2-server[3071]: Session 41: invalid session to write to.
Feb 23 01:53:14 netopeer2-server[3071]: Session 41: failed to write notification.
Feb 23 01:53:20 netopeer2-server[3071]: Session 44: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:20 netopeer2-server[3071]: Session 43: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:21 netopeer2-server[3071]: Session 43: invalid session to write to.
Feb 23 01:53:21 netopeer2-server[3071]: Session 43: failed to write notification.
Feb 23 01:53:27 netopeer2-server[3071]: Session 46: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:27 netopeer2-server[3071]: Session 45: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:27 netopeer2-server[3071]: Session 45: invalid session to write to.
Feb 23 01:53:27 netopeer2-server[3071]: Session 45: failed to write notification.
Feb 23 01:53:34 netopeer2-server[3071]: Session 48: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:34 netopeer2-server[3071]: Session 47: SSH channel poll error (Socket error: disconnected).
Feb 23 01:53:34 netopeer2-server[3071]: Session 47: invalid session to write to.
Feb 23 01:53:34 netopeer2-server[3071]: Session 47: failed to write notification.
Feb 23 02:06:59 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Feb 23 02:06:59 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Feb 23 02:06:59 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Feb 23 02:07:01 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Feb 23 02:07:01 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Feb 23 02:07:01 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
Feb 23 02:07:01 netopeer2-server[3071]: nc_ps_poll: failed to wait for a pollsession condition (Connection timed out).
In the routine nc_ps_poll() ,
**ret = pthread_cond_timedwait(&ps->cond, &ps->lock, &ts);
if (ret) {
/**
* This may happen when another thread releases the lock and broadcasts the condition
* and this thread had already timed out. When this thread is scheduled, it returns timed out error
* but when actually this thread was ready for condition.
*/
if ((ETIMEDOUT == ret) && (ps->queue[ps->queue_begin] == *id)) {
break;
}**
On a ETIMEDOUT error the mutex is not owned by the thread , and break results in the unlock of a mutex which is not owned by the thread. I feel ETIMEDOUT is a failure condition and the thread should be removed the queue and return error to the caller. we cannot move to the latest release as we have deployed the product and it is risky now up to upgrade the stack.
Hi,
if pthread_cond_timedwait fails with ETIMEDOUT, the mutex is reacquired (consult your manual page, I have found it in 2 different manual versions) so that is not the problem. Whether to fail in this corner case makes little difference, but I think it can stay the way it is, the whole lock is acquired even though the timeout has elapsed.
Now to the possible causes of the issue, I would first like to know more about your use-case because the log and the timestamps have a pattern. Firstly, are you using a specific stock netopeer2/libnetconf2 revision or have you made some changes yourself? Then, what can you tell me about the clients, why always 2 sessions disconnect at the same time? Finally, be aware that this issue could not be present in the current versions so I may not be able to help at all.
Regards, Michal
hi , Thanks for the response, we are using 2.0.11-r1 , we pulled one fix for ETIMEDOUT on top of 2.0.11-r1. but still we are seeing the issue, we don't know , what triggers the issue , but suddenly every thread starts printing the message. we will let you know if we are able to reproduce this. regards, -rk
Hi,
what may greatly help is netopeer2-server verbose (-v3) output. Also, the session errors are caused by clients, do you know why? It probably is not relevant, but just to be safe.
Regards, Michal