yarp icon indicating copy to clipboard operation
yarp copied to clipboard

yarpmanager gets (almost) stuck in case the connection to yarpserver is delayed

Open S-Dafarra opened this issue 1 year ago • 7 comments
trafficstars

Describe the bug

We have been largely exploiting yarpmanager to run demos involving several applications running on different PCs using different OS. So far, everything is great. Some difficulty arises when yarpserver is placed in a very delayed network. In this case, clicking the wrong button might render the system unusable. To give an example, if we press the refresh all button, it is not possible to do anything else in the application until all the applications, connections, and resources have been checked. The problem is that in case of a delayed network, this operation might require minutes. The problem gets worst if you try to close the application that got stuck as the entire yarpmanager gets completely stuck

To Reproduce One possible way to reproduce it is to set yarp conf to an IP not reachable, and then open yarpmanager without running yarpserver. Then, open the application with the largest number of applications and press on the refresh all button. Then, try to close the blocked application

Expected behavior When attempting to run another action while another one is running, yarpmanager should simply stop the previous one. At the same time, if the user wants to close the open application, it should close without blocking the entire yarpmanager.

Screenshots In the following I just tried to open an application and then I try to close it immediately after. In this specific case, the issue is that yarpserver is not running, but the same happens when yarpserver is in a delayed network

https://github.com/robotology/yarp/assets/18591940/aee83750-34e4-4823-a76f-50c532482cbc

Configuration (please complete the following information):

  • OS: Ubuntu/Windows
  • yarp version: 3.9.0
  • compiler:

Additional context Add any other context about the problem here.

cc @randaz81 @traversaro @Nicogene @GiulioRomualdi

S-Dafarra avatar Apr 16 '24 16:04 S-Dafarra

I dug a bit into the code. When pressing the "refresh all" button the following code is executed https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/mainwindow.cpp#L759-L775

which in turn calls

https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1710-L1717

and then

https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1353-L1406

The juicy bit is https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1398-L1400

that spawns a new thread

https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L467-L484

After this, safeManger appears busy until the action is done. This means that in all subsequent actions, the lines https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L1355-L1357 will always return false. This appears the reason why it is not possible to do anything else on the application until the previous action is done.

Instead, when closing the tab, the following code is executed: https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/mainwindow.cpp#L795-L864

which calls https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/applicationviewwidget.cpp#L2393-L2395

and then it calls https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L24-L29

This gets stuck in https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L25 since it is blocking until the separate thread finishes the run method. Since this is happening in the GUI thread, the whole application gets stuck and the OS tags it as "Not Responding"

A possible idea

In the end, the main issue is that the run method is blocking until everything has been done https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/yarpmanager/src-manager/safe_manager.cpp#L95-L381

One possibility would be to use a flag (an atomic boolean for example) to interrupt prematurely the execution of the different for loops to make sure that no action causes the entire yarpmanager to get stuck.

S-Dafarra avatar Apr 16 '24 16:04 S-Dafarra

This is a well known behaviour but it is exremily difficult to improve. Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time. So, if the network ping time is 100ms, each connection to verify might require 4x100ms. If you have 10 connections, these are 4 seconds of delay. No escape. The obvious solution is to keep your ping time to yarp server in the 10ms range or run the yarpserver elsewhere. Ps: fixing the network delay is your preferred option because a responsive yarp server with delayed communication is very bad too (and not equally detactable).

randaz81 avatar Apr 16 '24 16:04 randaz81

Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.

Is it possible to prematurely interrupt a ping without waiting for the timeout time?

S-Dafarra avatar Apr 16 '24 16:04 S-Dafarra

Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).

randaz81 avatar Apr 16 '24 17:04 randaz81

Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).

Yeah exactly, that's what I had in mind. In principle, also in case of connection checks, the maximum waiting time should be 2 seconds, so it might still be a reasonable time to wait.

S-Dafarra avatar Apr 16 '24 17:04 S-Dafarra

Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.

Is it possible to prematurely interrupt a ping without waiting for the timeout time?

No, this is not possible.

randaz81 avatar Apr 17 '24 07:04 randaz81

Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.

Is it possible to prematurely interrupt a ping without waiting for the timeout time?

No, this is not possible.

I had a further look on this. The timeout value seems to be used here https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/libYARP_os/src/yarp/os/Network.cpp#L753-L766 which sets an internal value that is get in https://github.com/robotology/yarp/blob/e4762b0b4a7946bf9379b183379bd32353d54a26/src/libYARP_os/src/yarp/os/impl/SocketTwoWayStream.cpp#L48-L51

So in the end, the timeout value is given directly to ACE. By reading the documentation of connect, I noticed that it returns a different error when it cannot perform a connection given the timeout (namely EWOULDBLOCK). At the same time, it is possible to call the complete method, that is exactly meant to continue a connection attempt that timed out.

This means that in principle it may be possible to use a fraction of the TIMEOUT value when calling connect and then call complete several times until the user-specified timeout value is reached. In principle, this could give us control to stop this process at an earlier stage.

S-Dafarra avatar Apr 17 '24 08:04 S-Dafarra