ros2_control
ros2_control copied to clipboard
Implicit contracts make ros2_control hard to use
Background
I really appreciate all the work that has gone into ros2_control. Recently I've been implementing the various plugins I need to make a robot arm work with ros2_control. Here I will try to summarize my experience and pain points around what you have to do to implement robot control with ros2_control.
Much of this can be summarized as ros2_control is not easy to use correctly because there are many implicit contracts. In order to use it correctly, I had to learn a ton about how it works internally. If we can find ways to reduce the amount people have to know about how the inside of ros2_control is implemented to implement against it their robot drivers will be more robust, the ros2_control team will have more freedom to change the internals, and robot drivers will be easier to write.
Much of this seems to stem from the shared memory for communicating between plugins. Here are things that were surprised to me about the shared memory:
- Write starts being called after on_init finishes, not after on_activate. This means that the values in the command vectors need to be valid after on_init because if they are not and invalid values in the command cause the write to fail, the system interface is stopped in a bad state.
- It wasn't clear to me who is responsible for initializing the shared memory. I now know it is the responsibility of the system interface (who owns the lifetime of the shared memory) to initialize it.
- I didn't expect
write
to be called with invalid/uninitialized values in the command. This means that inwrite
I need to validate the values in the command before sending them, there is no part of the interface that makes this clear. - When
write
is called there is no way for me to know what values are new without building tools to detect changes in values or something else.
This whole communication out of band method of sending data down to the system interface by my understanding is done because we want the flexibility to assign any hardware interfaces to any controller and track each one independently. I imagine another reason for the shared memory model is that there is a goal of this project to allocate as much of the memory used by the program up front and avoid all dynamic allocation at runtime to make the execution time of the loop much more deterministic.
There is a misconception that is commonly accepted without testing and that is that copying is slow, so we should design architectures to avoid copying. The reality of what is slow and what is fast in computers is much harder to understand because of how modern CPUs work. In my experience memory locality is often much more important than copying. I am not suggesting that we should introduce dynamic memory allocations into the main loop, but instead replace this shared memory implicit contract with functions that take doubles as arguments, copying the value from one part of the code to the other in a much more explicit way.
Functions for sending new values to system interface
My proposal is we make the sending of data from one layer to the next explicit by replacing the model of keeping track of pointers to doubles with functions. This is how I imagine it would work.
During initialization, the system interface would call export_command_interfaces
which would populate a vector of function objects instead of a vector of pointers to double. Those functions would have a signature that looks like this: CallbackReturn(double)
. Normally the author of the system interface would create lambdas here that will set the internal state variables of the system interface. The primary benefit to this change is that now the system interface has a block of code that executes when the values change, this makes it much more straightforward to track when those values get updated.
Data structures for enforcing different policies with not-updated data
The next thing to do is to create a data structure that represents a set of values to be used by the system interface. The reason for this is to make the policy of when values get written to hardware explicit. When these functions are called they would call an update function on this data structure that would be used in any case where writing to hardware is an atomic operation but always involves a certain set of values (a good example is just joint position values that in my case have to be written together in one message). There is a question of what do you in the case where only some of the values are new when write is called. I forsee there being several separate policies you might want to use when some of the values have not been updated:
- do nothing
- replace not-updated with what we previously sent and write
- replace not-updated with values from the latest read state and write
I think we should provide a data structure that implements these policies and would make the code in write much more trivial to write correctly. Here is an example of how I imagine it could work:
if (position_command.all_valid()) {
robot->Write(position_command.values());
}
Async Service interface
Secondly, I think we should provide an async service call interface for things like requesting the system interface to clear a protective stop. The way I think this should work is that the hardware interface should register a lambda for these calls that the controller can call that has the same interface that a callback for a service would have.
Trajectory execution succeeding when the robot hardware interface is in a bad state
This may seem like more of an annoyance than a fundamental design decision but I think there should be a way of a controller knowing when what they are commanding is not happening because the system interface they are commanding is in a bad state so it can fail.
The reality of what is slow and what is fast in computers is much harder to understand because of how modern CPUs work. In my experience memory locality is often much more important than copying.
How difficult would it be to write a toy problem that could give us useful data on this?
Secondly, I think we should provide an async service call interface for things like requesting the system interface to clear a protective stop.
I was talking with Lovro this morning about how we might recover from Protective Stops in the UR driver, and was thinking the GPIO controller (which receives the UR's status messages) could be chained in front of the JTC/whatever controller to provide graceful degradation on a joint or system level. There may be other use cases for an async service (or maybe chain controller is a bad idea here), but just putting it out there. This is easily an entire issue to think about in its own.
This may seem like more of an annoyance than a fundamental design decision but I think there should be a way of a controller knowing when what they are commanding is not happening because the system interface they are commanding is in a bad state so it can fail.
Agreed this would be good to have, and is another issue in its own. It would be nice to have some per joint "joint tolerance" that is enforced throughout trajectory execution. Something forgiving enough that the arm manufacturer's MoveJ dynamics smoothing wouldn't trigger a "false positive," but that if we are parameterizing the trajectory too aggressively (e.g. asking for higher acc than they can support) or something has gone wrong in the arm itself, it can call StopJ to safely stop the arm and report trajectory execution failure.
On the shared memory I created an example implementation of the command interface to make it more clear: https://compiler-explorer.com/z/154h7o89z
Agreed this would be good to have, and is another issue in its own. It would be nice to have some per joint "joint tolerance" that is enforced throughout trajectory execution. Something forgiving enough that the arm manufacturer's MoveJ dynamics smoothing wouldn't trigger a "false positive," but that if we are parameterizing the trajectory too aggressively (e.g. asking for higher acc than they can support) or something has gone wrong in the arm itself, it can call StopJ to safely stop the arm and report trajectory execution failure.
Something as simple as the joint trajectory execution should return a failure when the end state of the robot after executing the trajectory is not near the end state of the trajectory. The tolerance for this could be parameterized.
The reality of what is slow and what is fast in computers is much harder to understand because of how modern CPUs work. In my experience memory locality is often much more important than copying.
How difficult would it be to write a toy problem that could give us useful data on this?
I think we'd have to implement it in some way to measure it. I'll think about this more to see if I can figure out a simple way to show this. Here is a blog post on memory locality if that helps: https://gameprogrammingpatterns.com/data-locality.html
That being said, even if for some reason shared memory was more performant, I don't think the downsides to having implicit contracts in the interface of ros2_control would ever justify the performance difference. The burden to benchmark and justify a design lies more heavily on those that would advocate for implicit contracts in the interface of the library. Has ros2_control ever created a benchmark to show that their design of using shared memory for sending data between controllers and hardware interfaces is more performant than creating copies of trivially constructable types (like double)?
Thinking more about my design, I think this whole topic of memory locality doesn't apply because I am not actually getting rid of the shared memory. Instead what I am doing is making the act of setting the value explicit by replacing the pointer with a function call.
Something as simple as the joint trajectory execution should return a failure when the end state of the robot after executing the trajectory is not near the end state of the trajectory. The tolerance for this could be parameterized.
I believe JTC does this style of check and will fail the trajectory when JTC has commanded the last waypoint of the trajectory and both goal_time_tolerance
and goal_state_tolerance
are violated:
https://github.com/ros-controls/ros2_controllers/blob/master/joint_trajectory_controller/src/joint_trajectory_controller.cpp#L324
The goal_time_tolerance
is helpful as my understanding is that some robots (such as UR) will queue up commands if the requested trajectory is too fast, so if you request just a little too fast of a trajectory, this gives the robot's controller some time to finish before you fail and halt it. I think this is mostly helpful when your acceleration limit is too high, so it just falls behind on initial ramp up and final ramp down.
While some amount of goal_time_tolerance
is needed, I think it is more important that people are trying to tune the "magic numbers" for vel/acc limits (since we don't know what the arm manufacturer's current limits are given whatever dynamics it is measuring internally). This means more focus on a "small" state_tolerance
that is enforced throughout the trajectory. The reason I say this is that some arms might "queue" the commands to compensate for an aggressive trajectory, but others might not, and this inconsistency in how JTC performs could confuse developers. Ideally we could get some conservative vel/acc limits gven no payload from arm manufacturers so people aren't having to dial this in themselves.
On another sidenote (I will work with Denis to open a separate discussion on this), I recently learned more about some of the arm controller APIs, and ideally, we would detect when we are stopping the arm because a JTC tolerance is violated and either
- Have JTC ramp the arm down to a stop using the vel/acc/jerk limits
- Call whatever the arm's equivalent of "StopJ" is
Rather than what is currently done: commanding an instantaneous stop (e.g. with MoveJ). What ends up happening if you vel/acc limits are too aggressive, is that the arm falls behind, and if your goal_time_tolerance
isn't high enough to compensate, JTC commands an instantaneous stop, which can then result in the arm protective stopping itself because of the torque required to do that. I don't think that is what we want to happen.
The JTC was really just an annoyance in the theme of "ros2_control is easy to use incorrectly, I'd like better interfaces". I'd like to focus on the implicit contracts with the user and try to as much as possible change the API so the contracts are well defined in the interface. This way, ideally there are no surprises when adding support for new hardware or writing a controller.
I believe JTC does this style of check and will fail the trajectory when JTC has commanded the last waypoint of the trajectory and both
goal_time_tolerance
andgoal_state_tolerance
are violated: https://github.com/ros-controls/ros2_controllers/blob/master/joint_trajectory_controller/src/joint_trajectory_controller.cpp#L324
You say that but in my experience (I am working on galactic) this is not true. It is likely I have it misconfigured or have failed to configure it to have this behavior. As a user the fact that it returns true when executing a trajectory in the case where the hardware interface has stopped because it is in some sort of error state (I'm not talking about the case where the robot does move but fails to perfectly execute the trajectory) seems broken to me.
If ros2_control has stopped calling write on my command interfaces because I'm in an error state any controller writing to those interfaces should also fail.
I think we went down a rabbit hole on the JTC thing. I'm not talking about the problem of dealing with the hardware not perfectly doing what the controller is asking for but a much more software problem and less of the "are we there yet" controls problem.
GIVEN: hardware interface is stopped because it returned an error from read
or write
or a state transition, and a controller commanding command interfaces from that same hardware interface
WHEN: the controller sends a command
THEN: that command should fail and the controller should report the failure to the user of the controller
This is related to the discussion of implicit contracts. With the current design when a controller does this:
command_interface_[idx].set_value(value);
There is no built-in way for it to know that the hardware interface that should then write that value to the hardware will not in fact write that value to the hardware.
Instead, if the command interface was a function with the signature something like [[no_discard]] Status(double)
then the controller would have to deal with a failure. In the design I posted above the command interface is a lambda that is provided by the hardware interface. With the above design, this behavior could be added by adding another lambda inside ros2_control that has this behavior. Here is what the code inside the controller would then look like:
if (auto const result = command_interface_[name](value); has_error(result)) {
// report error to user of controller, either through ros or chained controller
}
Then inside ros2_control you would have something like this:
command_interface_[name] = NoDiscard{[&hardware_interface](auto value) {
if (auto const state = hardware_interface.get_state(); state == OK) {
return write_functions[name];
} else {
return state;
}
}};
This would use the write function returned from the hardware interface and wrap it in another function that would first check the state of the hardware interface, if that was bad it would return that, and if not it would call the write function provided by the hardware interface for that named interface. This would need to be done at the point where the hardware interface is registering its command interfaces so you could have a reference to that hardware interface. The NoDiscard thing is a trick I have in my fp library to add the [[no_discard]] attribute to a lambda: https://github.com/tylerjw/fp/blob/main/include/fp/no_discard.hpp
Note that if you are using tl::expected
in your return types you could write something like this:
auto const check_state = NoDiscard{[&hardware_interface](auto value)
-> tl::expected<decltype(value), Error> {
if (auto const state = hardware_interface.get_state(); state == OK) {
return value;
} else {
return state;
}
}};
command_interface_[name] = [&check_state, &write_functions_](auto value)
-> tl::expected<decltype(value), Error> {
return check_state(value).and_then(write_functions_[name]);
};
That is better because it composes the state check with the function that writes in the case where the state check succeeds. You could then re-use the state check to do similar chaining on the read functions in the state interfaces in cases where you wanted reads to fail if the hardware interface was in a bad state.