autoware.core
autoware.core copied to clipboard
feat: add autoware_node and autoware_control_center
Signed-off-by: M. Fatih Cırıt [email protected]
Description
Related Discussion: https://github.com/orgs/autowarefoundation/discussions/3194 Old PR: https://github.com/autowarefoundation/autoware.core/pull/57
This PR is for creating the base class for all the future Autoware Core nodes to inherit from.
This node will be inheriting from the rclcpp_lifecycle::LifecycleNode with (LifeCycle Readme).
I'll be taking ros2 nav2 stack as reference too while combining Lifecycle and Composable node concepts.
- https://navigation.ros.org/concepts/index.html#lifecycle-nodes-and-bond
- https://navigation.ros.org/configuration/packages/configuring-lifecycle.html
- https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/lifecycle_node.hpp
While developing this PR, I am going to explore ideas mentioned in previous conversations in the Autoware community.
Related links
Related links
- Autoware.Auto discussions: https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/821
- Discuss Implementing LifeCycleNode Support on Autoware.Auto Nodes - https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/282
- Autoware Diagnostics and Monitoring - https://gist.github.com/xmfcx/7eaae9750d6317a3a2aa23745ac99444
Communication architecture
Communication architecture
Registration Service
Service server: ACC
Service client: AutowareNode
Roles:
- Register the
AutowareNode
inACC
- Receive the UUID for the
AutowareNode
- If already registered, return error
sequenceDiagram
participant acc as Autoware Control Center
participant node as node_a
Note right of node: cli_reg
node ->>+ acc: name: node_a
Note left of acc: srv_reg
acc ->> acc: generate a UUID for the node_a
acc ->> acc: create a bond between acc and node_a
Note left of acc: srv_reg
acc -->>- node: UUID and registration success bool
Note right of node: cli_reg
Heart beat bond
~~https://github.com/ros/bond_core/blob/ros2/ros2_migration_readme.md~~ Slow
https://github.com/ros-safety/software_watchdogs
https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html
https://docs.ros.org/en/rolling/Concepts/About-Quality-of-Service-Settings.html#qos-events
Roles:
- As long as the
AutowareNode
is alive, it reports with this bond - If the
AutowareNode
dies, it triggers theon_broken
callback
Reporting service
Service server: ACC
Service client: AutowareNode
Roles:
- Reporting state of the
AutowareNode
to theACC
- Warning
- Error
Control service
Service server: AutowareNode
Service client: ACC
Roles:
- Shutting down the
AutowareNode
- Toggle
AutowareNode
between active-inactive?
Tasks
- [x] Implement registration service
- [x] Define the interface
- [x] Create the client
- [x] Create the service
- [x] Make them communicate
- [x] Generate UUID
- [x] Send UUID
- [ ] Implement control service
- [x] Implement reporting service
- [x] Implement watchdog - heartbeat
Tests performed
PR Review Items
Notes for reviewers
Pre-review checklist for the PR author
The PR author must check the checkboxes below when creating the PR.
- [x] I've confirmed the contribution guidelines.
- [x] The PR follows the pull request guidelines.
In-review checklist for the PR reviewers
The PR reviewers must check the checkboxes below before approval.
- [ ] The PR follows the pull request guidelines.
- [ ] The PR has been properly tested.
- [ ] The PR has been reviewed by the code owners.
Post-review checklist for the PR author
The PR author must check the checkboxes below before merging.
- [ ] There are no open discussions or they are tracked via tickets.
- [ ] The PR is ready for merge.
After all checkboxes are checked, anyone who has write access can merge the PR.
Codecov Report
Attention: Patch coverage is 13.97059%
with 351 lines
in your changes are missing coverage. Please review.
Please upload report for BASE (
main@9989140
). Learn more about missing BASE report.
:exclamation: Current head e4f5fca differs from pull request most recent head ed2efdb
Please upload reports for the commit ed2efdb to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## main #73 +/- ##
=======================================
Coverage ? 13.97%
=======================================
Files ? 10
Lines ? 408
Branches ? 189
=======================================
Hits ? 57
Misses ? 211
Partials ? 140
Flag | Coverage Δ | |
---|---|---|
differential | 13.97% <13.97%> (?) |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This pull request has been automatically marked as stale because it has not had recent activity.
@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.
@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.
OK, thanks @xmfcx
@Tom-Li-Lee to be assigned too.
This pull request has been automatically marked as stale because it has not had recent activity.
Priority list:
- Registering the node to ACC
- De-register / Re-register
- Handle different orders of launching
- Heartbeat/liveliness for the nodes, reports in last call milliseconds
- Receive error states in ACC (through a service call) (Node can report something is wrong with itself)
- Monitored Sub
- Creates a deadline callback which can either send (info/warning/error) through the previously implemented error handling
- Lifecycle
- delayed/controlled activate
- shutdown
- activate/deactivate cycles
@xmfcx I wrote some thoughts about how to handle different orders of launching. Could you please give some feedback on it?
Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.
Case ACC starts first. Nodes are not registered, so they send calls to register service. Get return from service and set Register true.
Case ACC starts after Nodes. Nodes are not registered, service calls in timer callback are skipped until ACC starts and service is available. As ACC is available, all happens as in Case ACC starts first.
Case ACC dies after All Nodes have been registered. ACC has a timeout after launching. If none node has registered during this time ACC will list all nodes with re-registered service and will send calls. So all nodes will be unregistered and all will work as in Case ACC starts first.
Each node has a re-register service (server) which will set Registered to false after service call.
I see, this makes sense :+1:. It seems we already have the unregister_node()
(should rename to deregister_node()
) function in the node_registry.hpp
. The AutowareNode instances can have deregister()
function too.
UUID for the ACC
There is also is_registered()
function which returns a boolean. (Maybe it's better if it returned a std::optional<UUID>
?)
This brings up a new topic: what do they register to? So it could make sense to have a UUID for each ACC too. So when a node registers, it will remember who it registered to. And when someone asks to an AutowareNode, it can respond with its corresponding ACC instance.
When faulty ACC dies and new ACC joins, the new ACC can list the nodes with bad ACC. It would deregister them and those nodes would register automatically to the new ACC.
Heartbeat from ACC
Additionally, ACC can have a heartbeat and all the AutowareNode instances would subscribe to this. And if it passes a deadline, they would automatically deregister themselves.
Registration async loop in the AutowareNode
Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.
We shouldn't need a re-register
service. Just a register
service should be enough. I might also be missing something, what do you think?
Also it would be important to have this registration checker loop to be an async operation. Like a background worker. So that it wouldn't affect the normal operation of the AutowareNode.
Thanks for working on this task! :heart_decoration:
@xmfcx Thank you for your thoughts.
It is a really good idea to have
UUID for ACC
so it will be easy to distinguish nodes which were registered to an old ACC.
Here is another case to consider. What if node has some issues during start-up and it will not be able to register? If so ACC will not know that node should even exist. Maybe ACC should have access to list of launching nodes?
As I remember you did some work in this direction, didn't you?
We shouldn't need a re-register service. Just a register service should be enough. I might also be missing something, what do you think?
It seems so for me. At least for now I think all cases can be reduced to register
if node can receive information that ACC has died.
Do you think that heartbeat to all all AutowareNodes is a better solution than to have service call to AutowareNodes from new ACC? Re-registering should not be a frequent case. It is kinda emergency. We will not have any additional overhead but it will be significant during the re-registration process in normal situation.
In the case of ACC heartbeat we will always have overhead.
Expected nodes list in ACC
In the future, we can implement a feature like this. But for now, ACC doesn't need to know what will launch.
As I remember you did some work in this direction, didn't you?
I thought about this potential but didn't do anything so far.
Deregistration conditions (from AutowareNode perspective)
Let's fix the terminology first (I also edited my previous posts): https://english.stackexchange.com/questions/25931/unregister-vs-deregister
state=unregistered -> 'register' -> state=registered -> 'deregister' -> state=unregistered`
We could have multiple conditions for deregistration.
-
deregister
service - timeout from the ACC heartbeat deadline
We should definitely have the deregister
service. And we can keep the ACC heartbeat implementation on hold for now.
Latest updates.
Heartbeat publishing was added to Autoware node class. It works as regular ros publisher on node_name/heartbeat topic with with additional QoS settings. So subscriber will be able to receive signals if that settings will be violated.
Autoware Control Center will subscribe to Autoware node heartbeat topic during node registration and will store pointers to subscriber and qos settings for particular Autoware node in node registry along with node_name and uuid. So registration callback in Autoware Control Center node register service should be updated.
@xmfcx Could you please elaborate?
2. reports in last call milliseconds
As for now there is heartbeat topic for each instance of Autoware node. ACC subscribes to this topic after successful node registration. For now callback of the sub does nothing just print heartbeat timestamp for debugging. https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L209
Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.
Also each of heartbeat subscribers has callback on the event of violation QoS (liveliness_callback). https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L191
We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.
- reports in last call milliseconds
I've meant "time passed since last heartbeat" here.
Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.
The ACC(Autoware Control Center) should keep track of the last heartbeat received for each AN(AutowareNode) and periodically report the time passed since last beat received from the AN.
Maybe we could have a report message (autoware_control_center_msgs::AutowareNodeReport
) in the ACC that lists the:
- Node Identifier: Unique ID or name for each AN
- Connectivity State:
- Connected: Active communication link.
- Disconnected: Communication lost.
- Never Connected: No record of past connection.
- Health State:
- Unknown: Status unclear due to communication issues or lack of updates.
- Healthy: Operating within normal parameters.
- Warning: Operating under suboptimal conditions.
- Error: Critical issues requiring immediate attention.
- Time Since Last Heartbeat
And we would have an array of these messages in a single message called autoware_control_center_msgs::AutowareNodeReports
which is periodically published.
We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.
This is a good idea but this callback will be probably called very often once a node dies. So it might be better to avoid adding a liveliness callback to the heartbeat messages. Instead, when we receive a heartbeat, just record the stamp. And when generating a report, we can calculate the time passed since last communication for each AN.
@xmfcx Thank you for your thoughts!
Maybe we could have a report message (
autoware_control_center_msgs::AutowareNodeReport
) in the ACC that lists the:
Yes, I have the same idea. And I have started to implement it.
this callback will be probably called very often once a node dies
Actually it called only on state change as I tested it before. So It called ones then node died. But I will think how it will be better to implement it.
Fixed double calling deregister service during ACC start-up procedure. AutowareNodeReports publishing was implemented in ACC. Error state receiver (service) in ACC is a wip.
Implemented AutowareNodeError service in ACC. Implemented AutowareNodeError client in AutowareNode.
For now AutowareNode is able to send it state to ACC in form of AutowareNodeState msg with 4 main states:
- UNKNOWN
- NORMAL
- WARNING
- ERROR
Implemented monitored subscription in AutowareNode https://github.com/autowarefoundation/autoware.core/pull/73/commits/b59c925846e450742d54397ed3aa25acead5cf57 but It produce error during run time in test_node
/home/alex/autoware.core/install/test_node/lib/test_node/test_node: symbol lookup error: /home/alex/autoware.core/install/test_node/lib/libtest_node_core.so: undefined symbol: _ZN13autoware_node12AutowareNode24test_create_subscriptionIN8std_msgs3msg7String_ISaIvEEESt5_BindIFMN9test_node8TestNodeEFvSt10shared_ptrIS6_EEPS9_St12_PlaceholderILi1EEEES5_N6rclcpp12SubscriptionIS6_S5_S6_S6_NSJ_23message_memory_strategy21MessageMemoryStrategyIS6_S5_EEEESN_EESA_IT2_ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNSJ_3QoSEOT0_RKNSJ_32SubscriptionOptionsWithAllocatorIT1_EENT3_9SharedPtrE
[ros2run]: Process exited with failure 127
Test_node is not able to call function from AutowareNode. It seem connected to linking library error.
With @xmfcx we identified the root cause of the issue to be the separation of the declaration and definition of a templated function across .cpp and .hpp files.
This approach led to linkage problems. By consolidating both the declaration and definition within the .hpp file and eliminating the definition from the .cpp file, we were able to resolve the issue effectively.
I use some utility functions for work with uuid from tier4_autoware_utility package. There was a build error because of tier4_autoware_utils as it is the part of autoware.universe and it is not presented in autoware.core.
We talked with @xmfcx about the issue and decided to port the uuid functions to autoware_common as it autoware.core already depend on it. I put uuid helper functions to autoware_utils. It was done in the PR .
For now the PR is ready for review but there are some drawbacks:
- I didn't switch RCLCPP_INFO messages to RCLCPP_DEBUG for more easy testing.
- There is the boost dependency issue. It seems like there is no boost lib installed in the container.
- I think it is better to introduce Lifecycle control in a next PR after this one will be merged and we try to use it to implement some nodes to see how it works with a real package.
Instruction how-to launch autoware_control_center and autoware_node (test_node).
There is no dedicated launch file for autoware_control_center. So you need to run it with command:
ros2 run autoware_control_center autoware_control_center
After you run it you need to start single test autoware node by command:
ros2 run test_node test_node
or 4 instances with different names and namespace by launch file:
ros2 launch test_node autoware_nodes.launch
Also you can start node first.
There is startup timer (for 10 s) in ACC which will send deregistration call (only once) to all autoware_nodes presented in the system in the case if previous ACC has died and no one node will registered during this period. If there is no nodes in the system nothing will happen. ACC will just stay alive and wait until some node will registered.
ACC publish topic /autoware_control_center/autoware_node_reports
with current state of registered nodes.
There are simple publisher node (test_pub) with proper Qos profile to test monitored subscription function. By default it publish to /topic
with proper frequency of 10 Hz and it is possible to change it with param hz.
To run test_pub use command:
ros2 run test_node test_pub --ros-args -p hz:=10
Each test_node has monitored subscriber configured to receive messages on /topic
with 10 hz if this condition is violated node will call service in ACC and will report error state.
hi @lexavtanke , I just followed your instructions and tried the test node. Everything works fine and the nodes defined in "autoware_nodes.launch" are registered or de-registered when I start or shut down the launch file. However, when I want to run a single test node using
ros2 run test_node test_node
It looks okay, but after I shut down the node and restarted it, the log shows below
The left window represents the ACC node, and the right represents the test node. There is always an error log even though the ACC looks normal.
Based on this attempt, I comment out the "test_node1", "test_node2" and "test_node3" in "autoware_nodes.launch" and only remain the "test_node", then perform start & shutdown & restart. This time, both the ACC node and the test node look good. There is nothing special in the launch file, so I'm curious about the difference.
Hi @JianKangEgon, thank you for your interest in this work and your feedback.
First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.
As in my experience I usually relaunch a whole system during a test.
For the difference between using the launch file and ros2 run test_node test_node
there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.
If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.
If I didn't understand you correctly could you please elaborate.
Hi @JianKangEgon, thank you for your interest in this work and your feedback.
First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.
As in my experience I usually relaunch a whole system during a test.
For the difference between using the launch file and
ros2 run test_node test_node
there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.
If I didn't understand you correctly could you please elaborate.
OK, thanks for your quick reply.
Further update: in this case, is it possible to display the info log instead of the error log? Since the node has been registered, the error log may cause unexpected misunderstandings.
I’ve added uuid_node last_node_state and node_report fields to AutowareNodeReport
message.
I’ve added logic to manage relaunch autoware_node
without relaunching autoware_control_center
. If ANode is relaunched and sends register request ACC updated information (uuid) in register and increase counter field (num_registered
) by one. ANode receives success code from ACC.
ACC warn log message for such case changed to info message.
I’ve added call AutowareNodeError
service in ANode constructor to send NORMAL
state to ACC after node creation.
@ahmeddesokyebrahim could you also review this?
@lexavtanke Could you fix the CI check errors?
brkay54 Hi, which one? If you are talking about uncrustify. It works a bit weird here as all errors are introduced by style(pre-commit): autofix
@lexavtanke I mean uncrustify
and DCO
. I saw the problem for uncrustify
. Thank you.
brkay54 DCO
errors belong to @xmfcx commits.
Everything look very fine.
I have only 1 question. Is there a reason we are not using doxygen comments like that
// For methods/functions
/**
* @brief Brief description of the method.
*
* Detailed description (if needed) goes here.
* @param[in] param1 Description of parameter 1.
* @param[out] param2 Description of parameter 2 (if any).
* @return Description of return value (if any).
*/
// For variables
/**
* @brief Brief description of the variable.
*
* Detailed description (if needed) goes here.
*/
This style is already used in Autoware BTW, and here is one example.
Finally, thanks a lot for the amazing efforts and great work done in this PR :+1: