gazebo-yarp-plugins icon indicating copy to clipboard operation
gazebo-yarp-plugins copied to clipboard

Robot not removed properly when using clock plugin

Open jeljaik opened this issue 9 years ago • 46 comments

While performing some checks on the synchronization of Gazebo and Matlab through the ySynchronizer, which arbitrates this process by sending rpc commands to either stop or continue the simulation, I noticed that when removing the robot Gazebo gets stuck while attempting to close GazeboYarpIMU with the following message:

*** GazeboYarpIMU closing ***
Closing Server Inertial...

This happens not only after I have used the ySynchronizer block but also by just placing the robot and removing it without further interaction. Totally reproducible.

jeljaik avatar Mar 27 '15 09:03 jeljaik

This error is not related to the IMU, but to any analog sensor. If you disable the IMU in the URDF, you will see that the simulation blocks trying to close jointvelocity sensors or forcetorque sensors (if any). The IMU is only the first one to be closed. I have been investigating this issue for 5 days now, and I managed to reduce the deadlock around the AnalogWrapper::Close() function. I have no idea what's down there, among RateThreads, ThreadImpl, PortCore, RpcPorts and so on, but one of the YARP components is deadlocked for sure. I can also add that even if the simulation is blocked, you can still yarp rpc /yourRobot/anyAnalogSensorRpcInterface and calibrate them, meaning that the communication thread is still running happily. @traversaro can we increase the scope and the gravity of this issue?

MirkoFerrati avatar Apr 04 '15 09:04 MirkoFerrati

Nuno and Prashanth should be investigating this issue as well.

jeljaik avatar Apr 04 '15 10:04 jeljaik

@prashanthr05 @nunoguedelha

naveenoid avatar Apr 04 '15 10:04 naveenoid

Ok, I just found out the single line that produces deadlock: https://github.com/robotology/yarp/blob/master/src/libYARP_dev/src/modules/AnalogWrapper/AnalogWrapper.cpp#L469

If you comment it out, recompile yarp, recompile GazeboYarpPlugin, everything will work (except that you do not get timestamps in the envelope of the analog sensor data). If you need this fast, you can apply this small change and stop reading.

A not so good workaround (but give me more time and it will become good) is:

  • turn m_jointsensorwrapper into a polydriver*
  • substitute this line (m_jointsensorwrapper.close()): https://github.com/robotology/gazebo-yarp-plugins/blob/master/src/gazebo_plugins/JointSensors.cc#L33

with something like this:

boost::thread* closeThread= new boost::thread(boost::bind(&close_thread,m_jointsensorsWrapper));

where close_thread is a function that takes a polydriver* and calls close.

I am not posting code because I am still writing a merge that does not leave boost::thread* objects around without deleting them.

The changes above should be applied also to IMU and ForceTorque sensors.

MirkoFerrati avatar Apr 04 '15 10:04 MirkoFerrati

@MirkoFerrati which version (checkout) of yarp are you using?

Some of the things that I think could be related:

  • Unfortunatly the clock synchronization logic in yarp is undocumented. However, the best way of learning its logic (besides looking at the original issue: https://github.com/robotology/gazebo-yarp-plugins/issues/57 is reading the source code, for example the one of the NetworkClock class: https://github.com/robotology/yarp/blob/master/src/libYARP_OS/src/NetworkClock.cpp . Note that code inspection is greatly simpliefied if you use a proper tool (IDE or editor plugin) for following the code execution flow.
  • @paulfitz did some changes in the network clock synchronization code in yarp to solve a similar bug : https://github.com/robotology/yarp/issues/331 ) perhaps this bug could be related to those changes?
  • @barbalberto recently merged some modifications of the analogWrapper, perhaps it can take a quick look if there is anything fishy going on around analogWrapper::close()
  • The original design idea of @paulfitz for clock synchronization was using a topic (persistent), but nobody had the time to properly test the use of a /clock topic over a /clock port, see https://github.com/robotology/gazebo-yarp-plugins/issues/157 (Probably not related, but it has always worried me because I never totally get all the internals of clock synchronization)

cc @nunoguedelha @prashanthr05

My personal impression is that that the gazebo physics thread is calling analogWrapper::close() during the plugin teardown. During the teardwon, something called by analogWrapper::close() is calling a yarp::os::Time::delay(), that however never exits because it waits for at least a clock tick (that triggers a NetworkClock::read() ). Unfortunately the clock is never updated, because it should be updated by the gazebo physics thread, that is the one calling the delay()!

P.s. while I was writing this reply, I quickly checked all the delay() calls in yarp codebase (easy if you use an IDE!) and I guess that there are some fishy calls in the DgramTwoWaysStream class interrupt and close methods: https://github.com/robotology/yarp/blob/master/src/libYARP_OS/src/DgramTwoWayStream.cpp (however they are relativly old, so if this behavior is a regression they are probably not related).

traversaro avatar Apr 04 '15 11:04 traversaro

@traversaro Inside analogWrapper::close() there is a call to Stamp::Update(), which in turn enters yarp::os::Time and deadlocks, I don't know it the reason is delay() or some different function ( I guess time::now() directly is the responsible). I am using an old Yarp version, along with an old GazeboYarpPlugin. I am not able of moving into master. I would appreciate if you can test this:

  • comment out line https://github.com/robotology/yarp/blob/master/src/libYARP_dev/src/modules/AnalogWrapper/AnalogWrapper.cpp#L469 and try if everything works on master

Btw, I checked the implementation of ControlBoardWrapper in order to understand why the controlboards close correctly. The timestamp is also updated in the thread, but in a different way: the time is not returned from time::now but from the encoder sensor structure (search for this timeStamp+=device.subdevices[k].encodersTimes[l]; ) , which is filled inside ControlBoardDriverEncoders in the getEncodersTimed with the value updated from the ControlBoardDriver m_lastTimestamp.update(_info.simTime.Double());

Basically, ControlBoardWrapper uses simulated timestamp, AnalogWrapper does not.

MirkoFerrati avatar Apr 04 '15 11:04 MirkoFerrati

Sorry, unable to test this at the moment. Which Stamp::update() are you referring to? Then one in analogWrapper::run() ?

traversaro avatar Apr 04 '15 11:04 traversaro

Yes, that one. I am testing my changes in a larger URDF model, there may be problems even with the ControlBoards, but they seem to raise only if another AnalogWrapper is closing. I will let you know.

MirkoFerrati avatar Apr 04 '15 11:04 MirkoFerrati

Ok, thanks. Anyway if someone has some time to allocate for serious debugging, I guess that get a clear idea of the issue we should:

  • Make sure that yarp and gazebo-yarp-plugins are compiled with debug simbols (compile in Debug or RelWithDebInfo) and preferably debug symbols for gazebo are installed (via the gazebo*-dbg packages).
  • Try to replicate this bug with only one AnalogWrapper (to reduce the number of threads)
  • Get a gdb backtrace for all the threads, so we can understand the deadlock .

traversaro avatar Apr 04 '15 12:04 traversaro

@MirkoFerrati I was thinking a bit about this, but are you sure that analogWrapper::run is called by the thread that is calling analogWrapper::close() ? At which line of which file the run method is called?

traversaro avatar Apr 04 '15 13:04 traversaro

Another fishy difference between ControlBoardWrapper and AnalogWrapper is in the close, in the AnalogWrapper the stop method is basically called two times:

    if (RateThread::isRunning())
    {
        RateThread::stop();
    }

    RateThread::stop();

traversaro avatar Apr 04 '15 13:04 traversaro

No no no, close() calls join() on the thread run(). But if run() is blocked waiting for a clock do unblock some yarp::os::Time primitives, the close() is blocked too.

Anyway, my changes to Handler.cc are working with AnalogWrapper, but not with IMU. It looks like ServerInertial.h has a very different structure from AnalogWrapper :(

Edit: the double call to stop is a typo, but nothing important, sinche the first call already deadlocks. In my version, the call is done only once and does not deadlock.

MirkoFerrati avatar Apr 04 '15 13:04 MirkoFerrati

Update: I have the ServerInertial closing fine, but I had to put an hack in its close function in order to tell our IMUDriver to stop reading from Gazebo. How can I call the close() method of a DeviceDriver from the ServerInertial that is using it as a IGenericSensor? Also: can I make some changes to the ServerInertial.h inside Yarp? What's the policy?

MirkoFerrati avatar Apr 04 '15 14:04 MirkoFerrati

Branch issue181 works as expected. I don't think we can merge the hack in master, but as soon as someone knows how to do the fix in the yarp way, it should be done! @traversaro will you take care of this? I for sure will propagate this fix in walk-man computers.

MirkoFerrati avatar Apr 04 '15 15:04 MirkoFerrati

  • Ah, I get what you meant. If the problem is relative to ThreadImpl::join, it could be related to https://github.com/robotology/yarp/issues/331 .
  • Yes, the ServerInertial device is quite different from the other wrapper + devices, It is quite old and quite counter intuitive. @barbalberto is refactoring that part of yarp, so I guess he can provide more information, also related to the IGenericSensor / IAnalogSensor duality. Anyway the policy of changes in YARP is that you can propose a pull request, where the change will be discussed and eventually merge. If you want to discuss a change before opening the pull request, feel free to open an issue in https://github.com/robotology/yarp .
  • Yes, I would avoid merging all this hacks in master. I will be in Paris so I will not directly take care of this, but I guess other people in the lab will make sure to find a proper solution. If someone really need this bug fixed as soon as possible, he can always use the issue181.

traversaro avatar Apr 04 '15 18:04 traversaro

TheadImpl::joint gives problem also here: https://github.com/robotology/yarp/issues/469

francesco-romano avatar Apr 05 '15 10:04 francesco-romano

I could reproduce this issue even without using the port /clock, without running Matlab/Simulink nor the synchronisation through the ySynchronizer. I just run the YARP server and Gazebo with the iCub model and the verbose option ( gazebo --verbose /Users/nunoguedelha/dev/icub-gazebo/world/icub.world ). When removing the robot, Gazebo gets stuck while attempting to close GazeboYarpIMU.

nunoguedelha avatar Apr 07 '15 10:04 nunoguedelha

If this can be helpful, from the documentation: void yarp::os::RateThread::stop ( )
Call this to stop the thread, this call blocks until the thread is terminated (and releaseThread() called). Actually calls join. This will deadlock if called from run(), use askToStop() instead.

Not sure this is the problem you are talking about.

lornat75 avatar Apr 07 '15 20:04 lornat75

This code:

if (RateThread::isRunning()) { RateThread::stop(); }

RateThread::stop();

seems strange though.

lornat75 avatar Apr 07 '15 20:04 lornat75

The first stop() is the right one, the second one should be commented out. The deadlock raises anyway. What's the expected behavior of a RateThread when its destructor gets called while running? Should it stop? Should it wait for join()? std::thread calls std::terminate and aborts the application.

MirkoFerrati avatar Apr 07 '15 21:04 MirkoFerrati

If I am not mistaken the destructor calls stop+join.If stops is called outside the thread I would say the problem is that the thread does not exit. Can you add a fprint(stderr) in the releaseThread() function?

lornat75 avatar Apr 08 '15 07:04 lornat75

I added couts in every destructor/stop/stopMain/join/joinImpl that was involved in the process. The stop() function basically sets a variable closing to true, and the thread loop is like while(!closing). The problem is that the thread does never reach the end of the current loop because it is blocked by a call to yarp::Time, which is blocked by gazebo not sending time because the simulation is blocked waiting for the destructor to destroy the thread. Thus, the thread will never check for the value of closing, making the stop function useless.

MirkoFerrati avatar Apr 08 '15 10:04 MirkoFerrati

Also, the difference between askToStop and Stop is that after setting closing, Stop will call Join, which is called anyway by the destructor.

MirkoFerrati avatar Apr 08 '15 10:04 MirkoFerrati

Relating the issue, when I try to simulate the robot in gazebo while running the ySynchronizer, I do not have any issue while removing the robot. The IMU and ForceTorque sensors are closed properly. The versions of codyco-superbuild (latest commit on Apr07) , yarp (latest commit on Apr08) and gazebo_yarp_plugins (latest commit on Mar23) are upto-date. The output on the terminal looks like, in the image. img_20150408_145612049

Although I have not looked deeply into the code of AnalogWrapper.cpp or the JointSensor.cc, I noticed that the changes suggested by @MirkoFerrati are not done i.e commenting the StateStamp or modifying the m_jointsensorwrapper.close().

prashanthr05 avatar Apr 08 '15 14:04 prashanthr05

@prashanthr05 I guess this is expected if you are not launching gazebo with the clock enabled (i.e. gazebo -slibgazebo_yarp_clock.so ) or you are launching gazebo without the YARP_CLOCK enviromental variable set.

@nunoguedelha that seems a quite strange behaviour. Which version of yarp and gazebo are you using?

traversaro avatar Apr 08 '15 15:04 traversaro

I've retested after updating all my repos to last master: I don't have the issue if the variable YARP_CLOCK is unset, whatever option is specified when launching gazebo (so, with or without -s libgazebo_yarp_clock.so). I have the issue if I set YARP_CLOCK=/clock, whatever option is specified when launching gazebo (so, with or without -s libgazebo_yarp_clock.so).

So the issue occurs as soon as you set YARP_CLOCK=/clock, even if the clock plugin is not enabled (and /clock + /clock/rpc ports created).

nunoguedelha avatar Apr 09 '15 14:04 nunoguedelha

@nunoguedelha This is expected, since the deadlock arises when trying to get an external clock which is not published. Notice that when a model is deleted from gazebo, the clock is no longer published, which is the same behavior as not running the libgazebo_yarp_clock plugin.

MirkoFerrati avatar Apr 12 '15 17:04 MirkoFerrati

Just to let you know, we tried as you mentioned in https://github.com/robotology/gazebo-yarp-plugins/issues/181#issuecomment-89553000 commenting out the line containing lastStateStamp.update() but the problem remains the same. We are using the master branches at the current versions of both YARP and gazebo-yarp-plugins.

jeljaik avatar Apr 13 '15 14:04 jeljaik

@jeljaik That test is outdated now. It could work on AnalogWrappers, but for sure not on ServerAnalog (which handles the IMU). If you are into testing, you can try commenting out the IMU plugin from your URDF and see what happens with the lastStateStamp.update() commented out. Anyway, since I am using an old version of YARP, there could be some more lines related to time which I missed in my version.

Another test is my branch issue181 which should solve all the problems even for latest YARP (at the cost of introducing some hacks in the code). https://github.com/robotology/gazebo-yarp-plugins/tree/issue181

Give it a try!

MirkoFerrati avatar Apr 13 '15 16:04 MirkoFerrati

Thank you I'm into properly debugging this along with Nuno. Thanks for the hints. We'll keep you posted.

On Mon, Apr 13, 2015, 18:12 MirkoFerrati [email protected] wrote:

@jeljaik https://github.com/jeljaik That test is outdated now. It could work on AnalogWrappers, but for sure not on ServerAnalog (which handles the IMU). If you are into testing, you can try commenting out the IMU plugin from your URDF and see what happens with the lastStateStamp.update() commented out. Anyway, since I am using an old version of YARP, there could be some more lines related to time which I missed in my version.

Another test is my branch issue181 which should solve all the problems even for latest YARP (at the cost of introducing some hacks in the code). https://github.com/robotology/gazebo-yarp-plugins/tree/issue181

Give it a try!

— Reply to this email directly or view it on GitHub https://github.com/robotology/gazebo-yarp-plugins/issues/181#issuecomment-92413248 .

jeljaik avatar Apr 13 '15 20:04 jeljaik