Standardize thread-safety documentation
Feature request
Feature description
Often, though not consistently, ROS 2 APIs explicitly state their thread (un)safety. These binary statements, when not qualified by a footnote (and even then), are quite unspecific. There's no mention as to:
- whether a function can be called concurrently with itself or not (re-entrancy),
- whether a function can be called concurrently with other functions in a given group or not,
- whether a function can be called concurrently on the same input and/or output argument or not.
- whether a function can be called concurrently on different (but potentially shared) input and/or output argument or not.
As a result, we have a vague qualifier for authors, reviewers, and users.
There have been some exchanges in the past, but still it creeps up every now and then. I think we should sit and unambiguously standardize how we document thread-safety.
Implementation considerations
It's worth exploring how other software standards and libraries do this.
For POSIX:
MT-Safe or Thread-Safe functions are safe to call in the presence of other threads.
Being MT-Safe does not imply a function is atomic, nor that it uses any of the memory synchronization mechanisms POSIX exposes to users. It is even possible that calling MT-Safe functions in sequence does not yield an MT-Safe combination.
Functions like memset or free are tagged as MT-Safe -- that is, no guarantees are made when using these functions' arguments concurrently with the same functions or others.
For Qt:
A thread-safe function can be called simultaneously from multiple threads, even when the invocations use shared data, because all references to the shared data are serialized.
A reentrant function can also be called simultaneously from multiple threads, but only if each invocation uses its own data.
Perhaps we should talk about reentrant functions and thread-safe data structures? Maybe we should state if reads or writes are to be expected on any given argument?
FYI @ivanpauno
Also, @ros2/team to kickstart the discussion.
whether a function can be called concurrently with itself or not,
isn't this exactly the definition of re-entrant? I tend to forget definitions, so maybe I'm wrong.
whether a function can be called concurrently with other functions in a given group or not,
This will usually only matter if the same object is passed to different functions of the group concurrently (you can also have problems if they share global state, but we probably want to avoid that).
I would also like to differentiate the case of "it's safe to call if concurrently" from "concurrent calls have some sort of memory ordering ensured". The second is a stronger guarantee than the first one.
We should take an account why we're documenting this:
- In the case of
rmwAPI, it's a requirement to the rmw implementation to ensure the specified thread safety. - In the case of
rclAPI, it's a hint for specific clients (e.g. rclcpp) so they can protect/synchronize function calls accordingly.
IMO if differentiating some of this cases will help rmw implementations and client libraries to get thread safety right, being more specific does worth it.
isn't this exactly the definition of re-entrant?
Whoops, yeah. Updated the description.
you can also have problems if they share global state, but we probably want to avoid that
We should do our best to avoid that in rcl for sure, but I will say that every requirement we put in rmw APIs constrains third-party implementations.
"it's safe to call if concurrently" from "concurrent calls have some sort of memory ordering ensured"
That's a good point. We should also define what we mean by safe. Program termination isn't the worst prospect. Races and contentions breaking caller expectations are much harder to track and fix.
IMHO if we want to make thread-safety an API requirement (and I think we have to, to some extent), we have to be specific. Also, the less unnecessary synchronization we do in upper layers the better the overall performance will be.
whether a function can be called concurrently with other functions in a given group or not,
This sounds like it could be really hard to track. I guess it would depend on these groups being very well specified?
A better angle might be to talk about thread-safe data structures rather than whether a group of functions can operate concurrently on a single instance.
Do you have an example of where you see this being used? For example the context structure when used in any of the rmw interface functions?
if we want to make thread-safety an API requirement (and I think we have to, to some extent)
+1 to making it a requirement.
A better angle might be to talk about thread-safe data structures rather than whether a group of functions can operate concurrently on a single instance.
I think that @hidmic comment was going in that direction. A function can be thread safely applied over a data structure, and another function can maybe not be applied thread safely over the same data structure.
What I meant was talking about data structures that contain their own locking or similar mechanisms to ensure they are thread safe, rather than putting the onus on the functions to declare that they use the data structure safely.
A function can be thread safely applied over a data structure
I don't understand what this means. Do you mean something like "all functions that are thread-safe over data structure X only ever read it"? That would imply that even though those functions are thread safe over X, you cannot use them at the same time as functions that write X or you would lose that thread safety.
On the other hand, if you mean "this function uses the locks to manage its access to X" then I think that you may as well just talk about whether X has associated locks or not, and make all functions that touch X thread-safe with respect to X by using the locks. Anything that doesn't use the locks would be broken rather than just "not thread safe".
What I meant was talking about data structures that contain their own locking or similar mechanisms to ensure they are thread safe, rather than putting the onus on the functions to declare that they use the data structure safely.
This makes sense to me. It'd seem to me that if we document (a) data structures' support for concurrent access (or lack thereof), and (b) functions' reentrancy and argument mutability (not always implied by the arguments' type, e.g. allocators), we would be painting a complete picture.
that would imply that even though those functions are thread safe over X, you cannot use them at the same time as functions that write X or you would lose that thread safety.
I will say that knowing when a read-write lock can be used instead of a mutex would be valuable.
Anything that doesn't use the locks would be broken rather than just "not thread safe".
Agreed.
I don't understand what this means. Do you mean something like "all functions that are thread-safe over data structure X only ever read it"? That would imply that even though those functions are thread safe over X, you cannot use them at the same time as functions that write X or you would lose that thread safety.
:+1: Reconsidering my previous comment, I agree with this. Only specifying if a data structure is thread safe or not is more logical.
@ros2/team ping on this one for any other ideas or opinions here.
https://github.com/ros2/rmw/pull/270 introduced extra paragraphs to explain thread safety of some functions, which I think is nice.
I think it would also be nice to have a note together with structures, commenting if they are expected to be thread safe or not.
e.g.: rmw_publisher_t is expected to be a thread safe object.
I think that would be enough to have thread-safety clearly documented.