studio icon indicating copy to clipboard operation
studio copied to clipboard

Reconnection to Rosbridge does not reliably re-advertise topics

Open rgov opened this issue 3 years ago • 13 comments

This comes from a report from a user of my extension, not sure what version they're using but it's fairly recent and on macOS.

My panel has a button that publishes a /bladescript/execute message. The panel is written as a React functional component like so:

function BladescriptPanel(
    { context }: { context: PanelExtensionContext }
): JSX.Element {
...
    // Set up rendering for the panel
    useLayoutEffect(() => {
        ...

        // Advertise that we publish scripts
        context.advertise?.("/bladescript/execute", "std_msgs/String");
    }, [context]);

    ...
}

Occasionally the user gets a complaint from Rosbridge when pressing the button that it was not expecting the publication of the /bladescript/execute message. My theory is that this has to do with when the connection is lost and then the connection gets re-established.

Am I misusing the API here?

image

rgov avatar Sep 28 '22 19:09 rgov

We at least try to re-advertise when reconnecting, but maybe something about it is not working properly:

https://github.com/foxglove/studio/blob/788ce4bdd8dba470f4227996bb8061f1ca392c76/packages/studio-base/src/players/RosbridgePlayer.ts#L664-L685

jtbandes avatar Sep 28 '22 20:09 jtbandes

Is there a way we can help debug this, since it's reproducing fairly reliably? We'd be happy to run a build with additional logging. I don't have time myself to wire that up though.

rgov avatar Sep 29 '22 19:09 rgov

Related issues: https://github.com/RobotWebTools/rosbridge_suite/issues/138 https://github.com/ros/ros_comm/issues/111

rgov avatar Sep 30 '22 18:09 rgov

You can mitigate this issue by setting a high enough unregister_timeout. The default value is 10 seconds, meaning that the publisher will be unregistered after 10 seconds when there are no more clients publishing to this topic.

You can set the unregister_timeout e.g. as a launch file argument (assuming ros1):

roslaunch rosbridge_server rosbridge_websocket.launch unregister_timeout:=3600

Also see https://github.com/RobotWebTools/rosbridge_suite/pull/322 where the timeout was made configurable.

achim-k avatar Oct 10 '22 16:10 achim-k

Thanks @achim-k. I don't fully understand the issue, though. Since you are familiar with it, would you please summarize it? In particular, I don't understand what it means to unregister a publisher, or who is unregistering it from where, and why. What problem does the timeout solve, rather than unregistering immediately? What is the consequence of not unregistering?

rgov avatar Oct 10 '22 16:10 rgov

In order for rosbridge clients to publish to ROS topics, the rosbridge server has to internally create a ROS publisher which is used to publish to the actual ROS topic. This is done when a client sends an advertise operation to the rosbridge server. Future messages sent from the rosbridge client (via the publish operation) will be published to the ROS topic using the previously created ROS publisher. Note that only a single ROS publisher is created per topic. E.g. when there are more then one rosbridge clients publishing to the same topic, the same ROS publisher is used on the server side.

After all clients that previously published to this topic disconnected (or called the unadvertise operation), the internal ROS publisher is unregistered after a certain timeout (unregister_timeout). This leads to the warning messages you mentioned above being logged, which is a consequence of https://github.com/ros/ros_comm/issues/111. So in order to mitigate this, your best option is to avoid unregistering of internal ROS publishers which can be done by setting the unregister_timeout to a high enough value.

achim-k avatar Oct 10 '22 17:10 achim-k

@achim-k Is there something Studio can do to re-establish the publisher when it is connected? Maybe it already does this and the warning shown on rosbridge is separate from that happening? The bug report here describes a situation where Studio has re-connected to rosbridge but not able to publish. Even with a 10 second timeout (the default) for publisher teardown we should be able to re-establish the publisher from Studio and rosbridge would publish the message.

From the end-user perspective (If I've read the issue correctly), we are in a situation where Studio has re-connected and everything looks ok but publishing doesn't work because we've not handled this in a way to get rosbridge to re-establish the publisher.

defunctzombie avatar Oct 10 '22 19:10 defunctzombie

Maybe it already does this and the warning shown on rosbridge is separate from that happening?

From Jacob's comment it looks like that Studio does everything right. But I will investigate if this is really the case.

achim-k avatar Oct 10 '22 20:10 achim-k

Thanks for investigating and for the explanation.

After all clients that previously published to this topic disconnected[,] the internal ROS publisher is unregistered after a certain timeout (unregister_timeout).

I assume the timeout gives the chance for the rosbridge client-publisher to return within a brief window after quitting (the page refresh scenario discussed in https://github.com/RobotWebTools/rosbridge_suite/issues/138), without any apparent change to the ROS topic graph that causes a subscriber to think the topic disappeared.

Then, if I understand these bugs correctly, the rosbridge has said "my last /foo client-publisher is gone, and it's been a few seconds and they never came back, I am going to cease publishing /foo". But something gets out of sync: some other node thinks rosbridge is still publishing /foo but when it actually makes the socket connection to rosbridge and asks for /foo, that's not on the internal list of published topics.

If we apply the workaround to use a long timeout, then we keep the external and internal view of what topics are published in sync by never allowing rosbridge to really unregister the topic.

rgov avatar Oct 10 '22 21:10 rgov

I assume the timeout gives the chance for the rosbridge client-publisher to return within a brief window after quitting (the page refresh scenario discussed in RobotWebTools/rosbridge_suite#138), without any apparent change to the ROS topic graph that causes a subscriber to think the topic disappeared.

Correct.

Then, if I understand these bugs correctly, the rosbridge has said "my last /foo client-publisher is gone, and it's been a few seconds and they never came back, I am going to cease publishing /foo". But something gets out of sync: some other node thinks rosbridge is still publishing /foo but when it actually makes the socket connection to rosbridge and asks for /foo, that's not on the internal list of published topics.

This is also how I understand it.

If we apply the workaround to use a long timeout, then we keep the external and internal view of what topics are published in sync by never allowing rosbridge to really unregister the topic.

Correct. I see no better alternative at the moment as https://github.com/ros/ros_comm/issues/111 is unlikely to get fixed in the future. There should be no issues keeping the ROS publishers instances alive, even when your rosbridge clients create 100+ publishers.

achim-k avatar Oct 10 '22 21:10 achim-k

We at least try to re-advertise when reconnecting, but maybe something about it is not working properly:

https://github.com/foxglove/studio/blob/788ce4bdd8dba470f4227996bb8061f1ca392c76/packages/studio-base/src/players/RosbridgePlayer.ts#L664-L685

We do however not call Roslib.Topic.advertise() so the topics are only re-advertised when a message is being published. I will submit a patch that actively calls advertise().

Edit: When doing that, I noticed some undesired behavior: We sent way too many advertise operations e.g. while setting up the publish panel configuration (see screencast). @jtbandes do you have an idea on how to solve this? Maybe adding some debounce to the publish panel?

Screencast from 12.10.2022 12:27:46.webm

achim-k avatar Oct 12 '22 15:10 achim-k

We could add a debounce if you like. But it also seems fine to send all these advertise/unadvertise operations — is it causing some problem?

jtbandes avatar Oct 12 '22 23:10 jtbandes

It is not a big problem, but these messages cause unnecessary changes to the ros graph so it would be good if we could avoid them.

In #4653, I parameterized the _setupPublishers method. This avoids the unnecessary websocket messages and also doesn't require adding a debounce.

achim-k avatar Oct 17 '22 17:10 achim-k