fetch_gazebo icon indicating copy to clipboard operation
fetch_gazebo copied to clipboard

Fetch navigation performs poorly in Melodic simulation

Open nickswalker opened this issue 5 years ago • 44 comments

Steps With up to date versions of fetch_ros and fetch_gazebo

roslaunch fetch_gazebo playground.launch

And

roslaunch fetch_gazeo_demo fetch_nav.launch 

Behavior When given a nav goal, the robot's localization drifts quickly (seems like it happens during rotation). The robot is never able to reach the goal.

https://youtu.be/w1y0b5aI3o8

Nothing jumps out from the standard move_base configurations so I'm not sure what's going on.

nickswalker avatar Jan 15 '19 22:01 nickswalker

Thanks, we'll take a look.

FYI: @velveteenrobot, @cjds @erelson & @narora1

moriarty avatar Jan 27 '19 02:01 moriarty

@nickswalker sorry for the delay everyone I originally tagged has been busy.

I've just spoken with @safrimus and he'll investigate. I've also created an internal JIRA ticket in hopes of not loosing track of this issue again. https://fetchrobotics.atlassian.net/browse/OPEN-31

moriarty avatar Feb 13 '19 18:02 moriarty

@nickswalker to clarify this only happens with melodic? and Gazebo 9?

cjds avatar Feb 13 '19 19:02 cjds

Yes, I have only observed this happening in Melodic with Gazebo 9.

nickswalker avatar Feb 13 '19 19:02 nickswalker

To me, it looks like its related to localization. Usually when you first localize robot, its particle cloud is pretty spread out the estimated robot position jumps around at bit while the robot figures out where it is. However, usually the particle cloud converges to the correct position, and this jumpiness stops happening. Take a look at the AMCL particle cloud output in RViz (PoseArray type, can't remember what topic is)

-Derek

On Wed, Feb 13, 2019 at 11:15 AM Nick Walker [email protected] wrote:

Yes, I have only observed this happening in Melodic with Gazebo 9.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fetchrobotics/fetch_ros/issues/102#issuecomment-463328878, or mute the thread https://github.com/notifications/unsubscribe-auth/ACS-71lGbQ5oObCJCXbElpyz8--x67j5ks5vNGQ6gaJpZM4aB56a .

-- *Derek King | **Systems Engineer *

Fetch Robotics, Inc. 2811 Orchard Parkway https://maps.google.com/?q=2811+Orchard+Parkway+San+Jose,+CA+95134&entry=gmail&source=g San Jose, CA 95134 https://maps.google.com/?q=2811+Orchard+Parkway+San+Jose,+CA+95134&entry=gmail&source=g

[email protected] [email protected]

dbking77 avatar Feb 14 '19 04:02 dbking77

Here are some clips with AMCL and the localization transforms visualized:

https://www.youtube.com/watch?v=uNb0pJbObHA

https://www.youtube.com/watch?v=sk4ANbCywUk

I pulled in all the recent Fetch changes and bumped to the latest Melodic sync.

It doesn't seem like the AMCL config, the Fetch Gazebo model, or any other component that might obviously cause localization to drift so quickly was changed between the indigo and melodic releases. But it's eminently reproduceable for me. I have a couple machines now where I can start a fresh workspace, clone everything, run the launch files and observe this behavior.

Let me know if bags would help.

nickswalker avatar Feb 14 '19 07:02 nickswalker

@nickswalker thanks, @safrimus was also able to reproduce immediately in the simulator following your steps in the original issue.

I haven't seen this on the actual hardware, can you confirm that navigation is working on your fetch running melodic?

moriarty avatar Feb 14 '19 09:02 moriarty

Yes, navigation has been working fine on the real robot

nickswalker avatar Feb 14 '19 16:02 nickswalker

This issue was previously: https://github.com/fetchrobotics/fetch_ros/issues/102

moriarty avatar Feb 27 '19 21:02 moriarty

@nickswalker can you test this again? And should we close this ticket as a duplicate of 30 ?

I tagged and released 0.9.0 of this package for Melodic, it was "good enough" but still not perfect, but we needed at least one released version into Melodic in order to setup the ros-pull-request-build jobs on the build farm.

moriarty avatar Mar 31 '19 05:03 moriarty

@nickswalker I'll add More Info Needed, and Help Wanted to this ticket.

More Info Needed: because I'd like to know how it's performing now. Help wanted: because we'll need help doing any further tuning on this.

moriarty avatar Apr 05 '19 20:04 moriarty

@moriarty We also have this issue on Ubuntu 18.04/Gazebo 9. I pulled latest master, which is the same as 0.9.0.

Here is the video: https://youtu.be/lLUQtOjqFnM. After I recorded this issue, it takes about 15 second to move to the last goal I set.

To reproduce:

roslaunch fetch_gazebo playground.launch
roslaunch fetch_gazebo_demo fetch_nav.launch
roscd fetch_navigation && rviz -d config/navigation.rviz

We also tested in 14.04 and Gazebo 2, and it works very well.

umhan35 avatar Jun 21 '19 01:06 umhan35

:disappointed: okay we'll need to look into this more

moriarty avatar Jun 21 '19 03:06 moriarty

@moriarty Sorry for bugging you...

We raised this issue because we use the fetchit codebase for a collaborative project with another university where they don't have a Fetch robot yet. If we can use navigation is sim, they can run our nav code instead of manually moving robot in Gazabo. We also do have code to publish to gazebo to move the robot, this does not count into the probabilistic nature of the navigation stack: the robot ends up in different position every time.

We can't go back to 14.04... Could you assign someone else to investigate this issue if you are busy...? We can also try to fix ourselves and create a pull request after the investigation. But it's more productive for Fetch staff to start the investigation.

Thanks.

@velveteenrobot

umhan35 avatar Jun 27 '19 16:06 umhan35

@moriarty @velveteenrobot Could you assign someone to investigate this?

umhan35 avatar Jul 26 '19 19:07 umhan35

Solved. Please take a look of the following link. https://github.com/fetchrobotics/fetch_gazebo/pull/101

AustinJia avatar Apr 14 '20 20:04 AustinJia

I was able to reproduce this issue using the code in #101 and the same steps as before. I don't think the problem is the inflation radius. Something about the simulation is going wrong causing drift during rotation. Given this, no amount of tuning navigation parameters is going make it localize well enough to go through doors.

nickswalker avatar Apr 17 '20 02:04 nickswalker

@nickswalker check #101 not for the code but for the comment from @mikeferguson

So, have you always been building from source? If so, I'd recommend setting CMAKE_BUILD_TYPE=Release. The change to TF2, and associated use of tf2_sensor_msgs::PointCloudIterator is very sensitive to compilation being Release (it's about 300x faster in Release mode than Debug). I've found several times that issues with timing go away when switching to Release build.

moriarty avatar Apr 25 '20 22:04 moriarty

fetchrobotics/fetch_ros@09db2ce01300b253b7fc7fd01ce258e20f2b9b41 file fetch_depth_layer/src/depth_layer.cpp

are likely causing the difference :( unfortunately the CMAKE_BUILD_TYPE -> Release did not seem to fix it.

moriarty avatar Apr 25 '20 23:04 moriarty

@@ -143,8 +144,8 @@ void FetchDepthLayer::onInitialize()
     camera_info_topic, 10, &FetchDepthLayer::cameraInfoCallback, this);
 
   depth_image_sub_.reset(new message_filters::Subscriber<sensor_msgs::Image>(private_nh, camera_depth_topic, 10));
-  depth_image_filter_ = boost::shared_ptr< tf::MessageFilter<sensor_msgs::Image> >(
-    new tf::MessageFilter<sensor_msgs::Image>(*depth_image_sub_, *tf_, global_frame_, 10));
+  depth_image_filter_ = boost::shared_ptr< tf2_ros::MessageFilter<sensor_msgs::Image> >(
+    new tf2_ros::MessageFilter<sensor_msgs::Image>(*depth_image_sub_, *tf_, global_frame_, 10, private_nh));
   depth_image_filter_->registerCallback(boost::bind(&FetchDepthLayer::depthImageCallback, this, _1));
   observation_subscribers_.push_back(depth_image_sub_);
   observation_notifiers_.push_back(depth_image_filter_);
@@ -275,16 +276,26 @@ void FetchDepthLayer::depthImageCallback(
   {
     // find ground plane in camera coordinates using tf
     // transform normal axis
-    tf::Stamped<tf::Vector3> vector(tf::Vector3(0, 0, 1), ros::Time(0), "base_link");
-    tf_->transformVector(msg->header.frame_id, vector, vector);
-    ground_plane[0] = vector.getX();
-    ground_plane[1] = vector.getY();
-    ground_plane[2] = vector.getZ();
+    geometry_msgs::Vector3Stamped vector;
+    vector.vector.x = 0;
+    vector.vector.y = 0;
+    vector.vector.z = 1;
+    vector.header.frame_id = "base_link";
+    vector.header.stamp = ros::Time();
+    tf_->transform(vector, vector, msg->header.frame_id);
+    ground_plane[0] = vector.vector.x;
+    ground_plane[1] = vector.vector.y;
+    ground_plane[2] = vector.vector.z;
 
     // find offset
-    tf::StampedTransform transform;
-    tf_->lookupTransform("base_link", msg->header.frame_id, ros::Time(0), transform);
-    ground_plane[3] = transform.getOrigin().getZ();
+    geometry_msgs::TransformStamped transform;
+    try {
+      transform = tf_->lookupTransform("base_link", msg->header.frame_id, msg->header.stamp);
+      ground_plane[3] = transform.transform.translation.z;
+    } catch (tf2::TransformException){
+      ROS_WARN("Failed to lookup transform!");
+      return;
+    }
   }
 
   // check that ground plane actually exists, so it doesn't count as marking observations

moriarty avatar Apr 25 '20 23:04 moriarty

I confirmed that doing a release build had no impact. I looked at reverting FetchDepthLayer to tf but stopped when I realized it would've required also changing the upstream DepthLayer code back as well.

I tried bypassing localization using fake_localization (added a ground truth odometry plugin to our robot model, tweaked our navigation launch file) and this is the behavior now: https://youtu.be/bF_NOWKgx5A

The local cost map still streaks on rotation, so it definitely seems related to the depth layer somehow not catching the correct transform. As soon as the robot starts rotating, the extra noise in the costmap makes it impossible to navigate through doorways.

nickswalker avatar May 15 '20 20:05 nickswalker

@nickswalker - did you ever resolve this? I am still seeing it on the latest release. I'd be interested in knowing if you root caused this or had other updates?

mkhansenbot avatar Feb 02 '21 21:02 mkhansenbot

No resolution and no updates from the previous comment

nickswalker avatar Feb 02 '21 22:02 nickswalker

OK, thanks for the update, I'm looking into it

mkhansenbot avatar Feb 03 '21 18:02 mkhansenbot

So I see the same issue when using fake_localization instead of AMCL, and it appears the "odom->base_link" TF is moving around quite a bit. So I suspect it's either a problem with the libfetch plugin or the friction of the wheels. The wheel friction was increased by #59, did you ever see the problem before then? I can try reverting that change to see if it makes a difference.

mkhansenbot avatar Feb 03 '21 21:02 mkhansenbot

Here's what I mean by the transforms being off.

image

mkhansenbot avatar Feb 03 '21 21:02 mkhansenbot

This is still an issue. Ubuntu 18.04.5, all of my fetch and ros packages are up to date. The odom transform actually reaches points where it is so far off that it's off the map. So something is wrong with the odometry.

cmcollander avatar Feb 06 '21 17:02 cmcollander

I'm not sure what the root cause of this is yet, it may have more than one root cause. However, here's what I think. I see that using 'fake_localization' I still have this problem, so I don't think that the odometry, wheel friction, or localization are the cause, although it is strange how much the odom transform drifts. When using the fake_localization however, the odom drift shouldn't matter, which is why I don't think that's the problem.

I'm more concerned with the local_costmap, which seems to be getting cleared incorrectly. Maybe @mikeferguson, @dlu, @stevemacenski or someone with a deeper knowledge of the costmap clearing can take a look at that. If you see my screenshot above, you'll see that as the robot rotates, it seems to cause the costmap to 'smear' previous and current observations. I think that is causing the local planner to get "trapped" and unable to find a path forward. I observe that sometimes after the "clear costmap" recovery, it's able to move again, but not every time, as the doorways are also very narrow compared to the inflation radius of 0.7m.

So, I have experimented with a few parameters changes and have a few that seem to at least work-around this issue. With these changes I can navigate room to room mostly fine, occasionally getting stuck temporarily before proceeding. Not perfect, but much better (at least for me).

In the fetch_navigation/config/costmap_local.yaml file, change: global_frame from 'odom' to 'map' - prevents the local_costmap from rotating, which seems to help with the smearing above update_frequency to 5.0 - clears / updates the costmap more often publish_frequency to 5.0 - publishes for observation in Rviz inflater/inflation_radius: 0.1 - this gives the local_planner more room to navigate through the doorways

global_frame: map

rolling_window: true
update_frequency: 5.0
publish_frequency: 5.0
inflater:
  inflation_radius: 0.1

Also, in the fetch_navigation/config/move_base.yaml file I set planner_frequency: 1.0. That tells the global_planner to re-plan every 1.0 second, and seems to also help the local planner get un-stuck.

I started digging into the local_costmap clearing code, but didn't see anything that seemed to be causing the problem. I might look at this some more but wanted to pass along my learnings so far to see if others have ideas / suggestions etc.

mkhansenbot avatar Feb 06 '21 20:02 mkhansenbot

So I also tried switching out the Fetch depth layer for the standard navigation obstacle layer, and I don't see any noticeable improvement. I also tried changing the amcl alpha1 param to 0.5 per this comment: https://github.com/fetchrobotics/fetch_gazebo/pull/101#issuecomment-620694079 from @mikeferguson and don't see much difference there either. I see I can navigate pretty well between the two tables, but navigating into the empty room is sometimes unsuccessful. The robot gets stuck in the doorway often.

One thing I may try, per the comment mentioned above, is changing to the DWA planner to see if that improves things. But right now I'm guessing a little bit, which isn't a good debug strategy. If anyone else has time to look into this and has ideas what could be wrong I'm open to collaborating.

mkhansenbot avatar Feb 10 '21 17:02 mkhansenbot

I also forgot to mention, I have also tried changing the conservative and aggressive reset distances = 0.0 to clear the local costmaps as cleanly as possible.

mkhansenbot avatar Feb 10 '21 17:02 mkhansenbot