Failed to write reply (would block) & POLL_IO_TIMEOUT
Hello !
I'm using versions:
netopeer2-2.2.35 sysrepo-3.3.10 libnetconf2-3.5.5 libyang-3.7.8
It seems I sporadically encounter an error with this message: [ERR]: LN: Session 1: Failed to write reply (would block)
Note that in my case I am running netopeer2 with POLL_IO_TIMEOUT set to 300 because the host is sometimes slow. https://github.com/search?q=repo%3ACESNET%2Fnetopeer2%20POLL_IO_TIMEOUT&type=code
I noticed, and another person as well, that it seems that increasing the POLL_IO_TIMEOUT causes all the rpcs to be delayed by the POLL_IO_TIMEOUT amount.
From my understand, this should be a timeout, meaning the maximum value before raising an error, not the time between 2 rpcs right ?
Because I tried to increase the value which shouldn't have side effect, but then I have new errors due to this "lock" Issue seems located around this line, because removing it seems to help in our scenario, but it might no be safe:
https://github.com/CESNET/libnetconf2/blob/61fbe731908809f88a187f223d43718479a7e0da/src/session_server.c#L1410
I am running netopeer2 with POLL_IO_TIMEOUT set to 300 because the host is sometimes slow
That should not matter, what was the exact reason you increased it?
causes all the rpcs to be delayed
What do you mean by that, the time between a client sending an RPC and receiving a reply?
It may be tricky to set these timeouts correctly but generally you should be able to keep the default values and adjust only the -t run-time parameter of the server itself.
That should not matter, what was the exact reason you increased it?
One environment is quite slow, so the issue is not the RPC reply time itself which should be below the timer provided with -i to the netopeer2-server. (70 or 90 seconds I don't remember), but with the lock for some operation due to overall disk speed etc I guess.
The 100ms default timeout was causing some issues and it's better with 300.
What do you mean by that, the time between a client sending an RPC and receiving a reply?
I mean that for instance with 300ms timeout, there is always at least POLL_IO_TIMEOUT duration between an rpc and its rpc-reply as you can see in the extract below:
You can try running with a huge timeout and you should see it as well. It seems something is holding the lock for the configured duration instead of only when it uses it.
From my colleague explanation, issue is that due to multithreading with the processing threads (option THREAD_COUNT) and their concurrent usage of nc_pollsession
Suspected flow: First thread use and release the nc_pollsession. Second starts to lock the nc_pollsession but keep it for POLL_IO_TIMEOUT duration (not sure why), causing the first one thread to not claiming the lock during this duration.
You can try running with a huge timeout and you should see it as well.
I have tried 2000 but all the RPC I sent got processed immediately. This will be a "feature" of your platform or specifically your thread planner or something like that. You can try decreasing the number of threads since it seems the environment is too slow to take advantage of multi-threading. Depending on the platform and the scheduler, you can even use only a single thread. That should solve all your locking issues and have no adverse effects as long as you always have only a single session. More sessions should still be supported but will be less responsive.
I tried a new run where I pushed the timer to 3000 (only change since previous run), and as you can see there are multiple rpcs where there is a delay around 3000 between rpc and reply:
It doesn't seems to always happen though, but we can see a pattern here compared to previous run.
Maybe it's when there are multiple rpcs done in quick succession, or with a subscription running (like it is in our case) ?
You can try to find the best configuration for you but I am pretty sure your platform affects this behavior and that there is nothing to improve in libnetconf2. It is generally relying on libssh or libc poll(), anyway.