UR3e connection overflow, robot disconnects and ROS requires full restart.
Affected ROS2 Driver version(s)
3.3.0-1noble.20250520.224221
Used ROS distribution.
Other
Which combination of platform is the ROS driver running on.
Ubuntu Linux with lowlatency kernel
How is the UR ROS2 Driver installed.
From binary packets
Which robot platform is the driver connected to.
UR E-series robot
Robot SW / URSim version(s)
5.21.3 on robot
How is the ROS driver used.
Headless without using the teach pendant
Issue details
Issue details
Summary
I am running the UR3e headless over ROS 2 (jazzy, which is a slightly newer version but all functionality seems fine.
I am controlling the robot with MoveIt 2.
During operation, on my hardware (which is an Intel NUC with a modern processor) I get this issue:
[ros2_control_node-1] [ERROR] [1750169798.297154061] [UR_Client_Library:]: Pipeline producer overflowed! <RTDE Data Pipeline>
[ros2_control_node-1] [ERROR] [1750169798.299138637] [UR_Client_Library:]: Pipeline producer overflowed! <RTDE Data Pipeline>
[ros2_control_node-1] [ERROR] [1750169798.301144542] [UR_Client_Library:]: Pipeline producer overflowed! <RTDE Data Pipeline>
At this point, on the teach pendant we get the issue:
socket_read_binary_integer: timeout
Socket timed out waiting for connection on reverse socket. The script will exit now.
External control: stopping communication and control.
I am running Ubuntu 24.04 with a realtime kernel (or specifically low latency: linux 6.12.28-rt10, installed according to the UR client library guide).
The issue persists regardless of the state of robot_receive_timeout (which by default is 0.02, and I've increased it up to the maximum of 0.2).
Issue details
If I echo the topic joint_states I get a frequency drop off as this problem occurs:
min: 0.000s max: 0.006s std dev: 0.00009s window: 10000
average rate: 499.850
min: 0.000s max: 0.006s std dev: 0.00010s window: 10000
average rate: 371.105
min: 0.000s max: 6.943s std dev: 0.06940s window: 10000
average rate: 371.136
min: 0.000s max: 6.943s std dev: 0.06940s window: 10000
average rate: 371.135
min: 0.000s max: 6.943s std dev: 0.06940s window: 10000
Steps to Reproduce
Launch UR3e (essentially Universal Robot's ROS driver, but we have custom code sending paths/planning collisions etc with MoveIt2, we are not using UR's MoveIt UI).
Run test script of motions (sending ROS 2 actions to set joint positions). If we run between two positions 100 times, it usually will fail (but sometimes makes it to 100)
Expected Behavior
No failure to maintain ros2 control frequency, no disconnect of UR/reverse socket timeout.
Actual Behavior
Robot fails execution during motion and full ROS stack needs restart to continue.
Workaround Suggestion
No workaround found yet.
I guess we could try downgrade to ROS 22.04 but I don't think this is the issue.
Relevant log output
Relevant log output
Accept Public visibility
- [x] I agree to make this context public
What does the network connection between the ROS machine and the robot look like? Is it a direct cable connection or does it go over a switch? What is the load on the ROS machine?
Direct cable connection over gigabit ethernet. We aren't running over a switch, and the ROS machine is running nothing else but the code for the UR3e.
We used stress to load all 16 cores of this CPU and whether we had 100% CPU utilisation or not, we kept running into this timeout.
Thank you for clarifying. This rules out the obvious error factors.
You say this occurs after a certain time, right? Is there any user interaction in between or not? I have seen it in the past that my ubuntu laptop machine has gone to a state after a while when I had the driver and a long term test running.
The first thing I would try to mitigate this is disabling any power management, as e.g. explained in this article.
It occurs intermittently, meaning it can occur straight away or after 100 cycles. We are not using Ubuntu server and the issue occurs when the hardware is in use and the screen is on, so I don't think power management is the culprit.
I did actually recompile the UR client library from source and attempt to increase the robot_receive_timeout above 0.2 to see if this helps, but it seems like this is quite an important number and the driver segfaults if I do so. (I am keen to spend a bit more time to get this to work, but it might not be possible. Even if it is, it doesn't explain why the socket times out—this seems to be the actual issue to me).
Thank you.
Maybe to give you some more context:
socket_read_binary_integer: timeout
Socket timed out waiting for connection on reverse socket. The script will exit now.
External control: stopping communication and control.
This occurs when the script running on the robot doesn't get a new command in time from the ROS side. The timeout for this isn't compiled into the library (that's only the default value), but it is set in the driver explicitly:
https://github.com/UniversalRobots/Universal_Robots_ROS2_Driver/blob/98d66d07f82733885e85e1fc2ce097ad6da0d954/ur_robot_driver/src/hardware_interface.cpp#L843
That timeout can be set through a hardware parameter in the ros2_control.xacro. Changing the default in the client library shouldn't make your driver crash.
On the other hand,
[ros2_control_node-1] [ERROR] [1750169798.301144542] [UR_Client_Library:]: Pipeline producer overflowed! <RTDE Data Pipeline>
indicates that the control loop running on the ROS pc is not running reliably with 500 Hz. The robot is sending data to the ROS machine with 500 Hz where data is stored into a queue. The control loop pops items from that queue. If items are not removed from the queue fast enough, the queue will get filled up until it eventually has no more capacity which is when this error is triggered.
The reasons why your control loop is not keeping up 500 Hz can be for example:
- Not enough hardware resources available (seems not to be the case for you?)
- Scheduling / thread priority issues -> Did you enable FIFO scheduling with high priority as explained in https://docs.universal-robots.com/Universal_Robots_ROS_Documentation/doc/ur_client_library/doc/real_time.html#setup-user-privileges-to-use-real-time-scheduling?
- Some operation in the cycle takes too long
- Any other hardware component's
read()orwrite()operation takes time. Do you have other hardware with ros2_control in your system? - Any controller's
update()takes too long
- Any other hardware component's
Both issues are symptoms from a control loop not running reliably at 500 Hz, so that might be the top priority to debug.
Edit: To further debug this, you can use ros2_control's built-in diagnostics with the rqt_runtime_monitor. There you will be able to debug a couple of numbers. My test setup consists of a real base joint and the other joints being run by mock_hardware
Hi, I can take over from my colleague above. Thanks for the help and detailed reply.
The fact the control loop is not running at 500 Hz is worrying and was mentioned in our original post. I do not believe we are hardware limited but I guess we could try with more powerful hardware.
I have followed the realtime kernel setup guide in its entirety, including the section at the end to disable CPU speed scaling. When you say FIFO scheduling, this is not mentioned in the guide. Do you mean to actively set this using chrt on the process? (e.g. sudo chrt -f -p <priority> <pid>)
No other hardware is using ros2_control, so it must just be UR's stuff stalling for some reason?
We have code running on top of this (primarily MoveIt) but this should not affect the control loop, or do you advise otherwise?
I will have a look at rqt_runtime_monitor, thank you.
When you say FIFO scheduling, this is not mentioned in the guide. Do you mean to actively set this using chrt on the process? (e.g. sudo chrt -f -p
)
This is done by ros2_control and the driver automatically. If you don't see a warning about this during startup, everything is setup correctly.
We have code running on top of this (primarily MoveIt) but this should not affect the control loop, or do you advise otherwise?
That depends on the load that causes. But I wouldn't expect that.
Let's have a look at the runtime diagnostics.
Yep no warnings during startup.
Here are some outputs, a warning is observed when the pipeline overflows:
Evidently the execution time scaling to such a high number is the problem, but its cause I am yet uncertain.
The three menus expanded:
That high stddev on the JTC is definitively worrying and could be the source of problems. Those stats unfortunately don't show how often a higher execution time occurs, but that execution time of 291 ms would definitively cause problems, as this is equivalent to ~ 150 control cycles. Given the average of 16.6 us it was probably indeed a single outlier.
How large are the trajectories (number of points) that you send to the controller? I remember an issue about long trajectories in the JTC (https://github.com/ros-controls/ros2_controllers/issues/1293) that showed a high spike in the cycle time once a new trajectory was received (though it was "only" 8 instead of 4 ms there).
As a workaround you could try to use the passthrough_trajectory_controller instead of the scaled jtc. It acts as a drop-in replacement and loads trajectory execution to the robot controller rather than the ROS machine. The numbers you posted seem to indicate that the problem could be somewhere in the JTC, so I would be curious whether that changes something.
I am pretty sure it is a single outlier from my inspection. Because it is task-ending, we of course do not have the opportunity to encounter it again before restarting ROS.
I will report back on trajectory length, I really appreciate all the help. Some of the trajectories take ~10 s to execute so it's likely they are on the order of 100,000 point as mentioned in that post.
The passthrough_trajectory_controller is an incredible suggestion which I hadn't thought to try. I tried it a bit this evening and it seemed to suggest some improvement. I will be running the robot all day tomorrow so I will be able to give some more definitive answer.
I will update back—thank you for all the time you have spent assisting us here.
We have been running with the passthrough JTC for a couple of weeks now and it is working really well for us.
I do observe a message on the teach pendant mentioned how infinite velocities are requested, but this doesn't seem to block operation. (TODO: insert message here when next observed)
That said, the initial issue is still not resolved.
I hope this workaround provides some value to those onlooking.
@urfeex, when running we get trajectories selecting 'quintic splines' I have observed others running with cubic—is this configurable in some way?
I'm happy that the passthrough trajectory controller is a working workaround for you, but I agree that the initial topic isn't resolved and should get investigated further. I'm still in the process of setting up a long term test in order to be able to run those long-running tests without blocking my dev machine.
when running we get trajectories selecting 'quintic splines' I have observed others running with cubic—is this configurable in some way?
When providing setpoints with positions and velocities, cubic splines will be used. When also providing accelerations, quintic splines will be used. The latter should be smoother, so they should be preferable if your have accelerations in your setpoints available.