yarp
yarp copied to clipboard
yarpmanager gets (almost) stuck in case the connection to yarpserver is delayed
Describe the bug
We have been largely exploiting yarpmanager to run demos involving several applications running on different PCs using different OS. So far, everything is great. Some difficulty arises when yarpserver is placed in a very delayed network. In this case, clicking the wrong button might render the system unusable. To give an example, if we press the refresh all button, it is not possible to do anything else in the application until all the applications, connections, and resources have been checked. The problem is that in case of a delayed network, this operation might require minutes. The problem gets worst if you try to close the application that got stuck as the entire yarpmanager gets completely stuck
To Reproduce
One possible way to reproduce it is to set yarp conf to an IP not reachable, and then open yarpmanager without running yarpserver. Then, open the application with the largest number of applications and press on the refresh all button. Then, try to close the blocked application
Expected behavior
When attempting to run another action while another one is running, yarpmanager should simply stop the previous one. At the same time, if the user wants to close the open application, it should close without blocking the entire yarpmanager.
Screenshots
In the following I just tried to open an application and then I try to close it immediately after. In this specific case, the issue is that yarpserver is not running, but the same happens when yarpserver is in a delayed network
https://github.com/robotology/yarp/assets/18591940/aee83750-34e4-4823-a76f-50c532482cbc
Configuration (please complete the following information):
- OS: Ubuntu/Windows
- yarp version: 3.9.0
- compiler:
Additional context Add any other context about the problem here.
cc @randaz81 @traversaro @Nicogene @GiulioRomualdi
I dug a bit into the code. When pressing the "refresh all" button the following code is executed https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/mainwindow.cpp#L759-L775
which in turn calls
https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1710-L1717
and then
https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1353-L1406
The juicy bit is https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1398-L1400
that spawns a new thread
https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L467-L484
After this, safeManger appears busy until the action is done. This means that in all subsequent actions, the lines
https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1355-L1357
will always return false. This appears the reason why it is not possible to do anything else on the application until the previous action is done.
Instead, when closing the tab, the following code is executed: https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/mainwindow.cpp#L795-L864
which calls https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L2393-L2395
and then it calls https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L24-L29
This gets stuck in
https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L25
since it is blocking until the separate thread finishes the run method.
Since this is happening in the GUI thread, the whole application gets stuck and the OS tags it as "Not Responding"
A possible idea
In the end, the main issue is that the run method is blocking until everything has been done https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L95-L381
One possibility would be to use a flag (an atomic boolean for example) to interrupt prematurely the execution of the different for loops to make sure that no action causes the entire yarpmanager to get stuck.
This is a well known behaviour but it is exremily difficult to improve. Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time. So, if the network ping time is 100ms, each connection to verify might require 4x100ms. If you have 10 connections, these are 4 seconds of delay. No escape. The obvious solution is to keep your ping time to yarp server in the 10ms range or run the yarpserver elsewhere. Ps: fixing the network delay is your preferred option because a responsive yarp server with delayed communication is very bad too (and not equally detactable).
Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.
Is it possible to prematurely interrupt a ping without waiting for the timeout time?
Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).
Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).
Yeah exactly, that's what I had in mind. In principle, also in case of connection checks, the maximum waiting time should be 2 seconds, so it might still be a reasonable time to wait.
Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.
Is it possible to prematurely interrupt a ping without waiting for the timeout time?
No, this is not possible.
Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.
Is it possible to prematurely interrupt a ping without waiting for the timeout time?
No, this is not possible.
I had a further look on this. The timeout value seems to be used here https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/libYARP_os/src/yarp/os/Network.cpp#L753-L766 which sets an internal value that is get in https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/libYARP_os/src/yarp/os/impl/SocketTwoWayStream.cpp#L48-L51
So in the end, the timeout value is given directly to ACE. By reading the documentation of connect, I noticed that it returns a different error when it cannot perform a connection given the timeout (namely EWOULDBLOCK). At the same time, it is possible to call the complete method, that is exactly meant to continue a connection attempt that timed out.
This means that in principle it may be possible to use a fraction of the TIMEOUT value when calling connect and then call complete several times until the user-specified timeout value is reached. In principle, this could give us control to stop this process at an earlier stage.