autoware.core icon indicating copy to clipboard operation
autoware.core copied to clipboard

feat: add autoware_node and autoware_control_center

Open xmfcx opened this issue 2 years ago • 41 comments

Signed-off-by: M. Fatih Cırıt [email protected]

Description

Related Discussion: https://github.com/orgs/autowarefoundation/discussions/3194 Old PR: https://github.com/autowarefoundation/autoware.core/pull/57

This PR is for creating the base class for all the future Autoware Core nodes to inherit from.

This node will be inheriting from the rclcpp_lifecycle::LifecycleNode with (LifeCycle Readme).

I'll be taking ros2 nav2 stack as reference too while combining Lifecycle and Composable node concepts.

  • https://navigation.ros.org/concepts/index.html#lifecycle-nodes-and-bond
  • https://navigation.ros.org/configuration/packages/configuring-lifecycle.html
  • https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/lifecycle_node.hpp

While developing this PR, I am going to explore ideas mentioned in previous conversations in the Autoware community.

Related links

Related links
  • Autoware.Auto discussions: https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/821
  • Discuss Implementing LifeCycleNode Support on Autoware.Auto Nodes - https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/282
  • Autoware Diagnostics and Monitoring - https://gist.github.com/xmfcx/7eaae9750d6317a3a2aa23745ac99444

Communication architecture

Communication architecture

Registration Service

Service server: ACC

Service client: AutowareNode

Roles:

  • Register the AutowareNode in ACC
  • Receive the UUID for the AutowareNode
  • If already registered, return error
sequenceDiagram
    participant acc as Autoware Control Center
    participant node as node_a

    Note right of node: cli_reg
    node ->>+ acc: name: node_a
    Note left of acc: srv_reg

    acc ->> acc: generate a UUID for the node_a
    acc ->> acc: create a bond between acc and node_a


    Note left of acc: srv_reg
    acc -->>- node: UUID and registration success bool
    Note right of node: cli_reg

Heart beat bond

~~https://github.com/ros/bond_core/blob/ros2/ros2_migration_readme.md~~ Slow

https://github.com/ros-safety/software_watchdogs

https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html

https://docs.ros.org/en/rolling/Concepts/About-Quality-of-Service-Settings.html#qos-events

Roles:

  • As long as the AutowareNode is alive, it reports with this bond
  • If the AutowareNode dies, it triggers the on_broken callback

Reporting service

Service server: ACC

Service client: AutowareNode

Roles:

  • Reporting state of the AutowareNode to the ACC
    • Warning
    • Error

Control service

Service server: AutowareNode

Service client: ACC

Roles:

  • Shutting down the AutowareNode
  • Toggle AutowareNode between active-inactive?

Tasks

  • [x] Implement registration service
    • [x] Define the interface
    • [x] Create the client
    • [x] Create the service
    • [x] Make them communicate
    • [x] Generate UUID
    • [x] Send UUID
  • [ ] Implement control service
  • [x] Implement reporting service
  • [x] Implement watchdog - heartbeat

Tests performed

PR Review Items

Notes for reviewers

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

  • [ ] The PR follows the pull request guidelines.
  • [ ] The PR has been properly tested.
  • [ ] The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

  • [ ] There are no open discussions or they are tracked via tickets.
  • [ ] The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

xmfcx avatar Feb 03 '23 14:02 xmfcx

Codecov Report

Attention: Patch coverage is 13.97059% with 351 lines in your changes are missing coverage. Please review.

Please upload report for BASE (main@9989140). Learn more about missing BASE report.

:exclamation: Current head e4f5fca differs from pull request most recent head ed2efdb

Please upload reports for the commit ed2efdb to get more accurate results.

Files Patch % Lines
...are_control_center/src/autoware_control_center.cpp 19.25% 65 Missing and 44 partials :warning:
common/autoware_node/src/autoware_node.cpp 0.00% 94 Missing :warning:
...ntrol_center/test/test_autoware_control_center.cpp 14.66% 6 Missing and 58 partials :warning:
...utoware_control_center/test/test_node_registry.cpp 18.42% 0 Missing and 31 partials :warning:
common/test_node/src/test_pub.cpp 0.00% 22 Missing :warning:
...mmon/autoware_control_center/src/node_registry.cpp 56.52% 3 Missing and 7 partials :warning:
common/test_node/src/test_node.cpp 0.00% 10 Missing :warning:
...ontrol_center/src/autoware_control_center_node.cpp 0.00% 9 Missing :warning:
...utoware_control_center/autoware_control_center.hpp 0.00% 1 Missing :warning:
.../include/autoware_control_center/node_registry.hpp 0.00% 1 Missing :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #73   +/-   ##
=======================================
  Coverage        ?   13.97%           
=======================================
  Files           ?       10           
  Lines           ?      408           
  Branches        ?      189           
=======================================
  Hits            ?       57           
  Misses          ?      211           
  Partials        ?      140           
Flag Coverage Δ
differential 13.97% <13.97%> (?)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 03 '23 14:02 codecov[bot]

This pull request has been automatically marked as stale because it has not had recent activity.

stale[bot] avatar May 15 '23 10:05 stale[bot]

@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.

xmfcx avatar May 29 '23 14:05 xmfcx

@JianKangEgon I've also added you as an assignee too, after talking with @liuXinGangChina If you have any questions, please let me know.

OK, thanks @xmfcx

JianKangEgon avatar May 31 '23 01:05 JianKangEgon

@Tom-Li-Lee to be assigned too.

xmfcx avatar Jun 13 '23 15:06 xmfcx

This pull request has been automatically marked as stale because it has not had recent activity.

stale[bot] avatar Nov 09 '23 23:11 stale[bot]

Priority list:

  1. Registering the node to ACC
    1. De-register / Re-register
    2. Handle different orders of launching
  2. Heartbeat/liveliness for the nodes, reports in last call milliseconds
  3. Receive error states in ACC (through a service call) (Node can report something is wrong with itself)
  4. Monitored Sub
    1. Creates a deadline callback which can either send (info/warning/error) through the previously implemented error handling
  5. Lifecycle
    1. delayed/controlled activate
    2. shutdown
    3. activate/deactivate cycles

xmfcx avatar Dec 12 '23 09:12 xmfcx

@xmfcx I wrote some thoughts about how to handle different orders of launching. Could you please give some feedback on it?

Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.

Case ACC starts first. Nodes are not registered, so they send calls to register service. Get return from service and set Register true.

Case ACC starts after Nodes. Nodes are not registered, service calls in timer callback are skipped until ACC starts and service is available. As ACC is available, all happens as in Case ACC starts first.

Case ACC dies after All Nodes have been registered. ACC has a timeout after launching. If none node has registered during this time ACC will list all nodes with re-registered service and will send calls. So all nodes will be unregistered and all will work as in Case ACC starts first.

lexavtanke avatar Dec 20 '23 12:12 lexavtanke

Each node has a re-register service (server) which will set Registered to false after service call.

I see, this makes sense :+1:. It seems we already have the unregister_node() (should rename to deregister_node()) function in the node_registry.hpp. The AutowareNode instances can have deregister() function too.

UUID for the ACC

There is also is_registered() function which returns a boolean. (Maybe it's better if it returned a std::optional<UUID> ?)

This brings up a new topic: what do they register to? So it could make sense to have a UUID for each ACC too. So when a node registers, it will remember who it registered to. And when someone asks to an AutowareNode, it can respond with its corresponding ACC instance.

When faulty ACC dies and new ACC joins, the new ACC can list the nodes with bad ACC. It would deregister them and those nodes would register automatically to the new ACC.

Heartbeat from ACC

Additionally, ACC can have a heartbeat and all the AutowareNode instances would subscribe to this. And if it passes a deadline, they would automatically deregister themselves.

Registration async loop in the AutowareNode

Each node has flag Registered and callback for registration (call to register service). Callback has service availability check (skip iteration). If the Registered is true, the timer stops. At the start up, the Registered is false. Each node has a re-register service (server) which will set Registered to false after service call.

We shouldn't need a re-register service. Just a register service should be enough. I might also be missing something, what do you think?

Also it would be important to have this registration checker loop to be an async operation. Like a background worker. So that it wouldn't affect the normal operation of the AutowareNode.

Thanks for working on this task! :heart_decoration:

xmfcx avatar Dec 21 '23 04:12 xmfcx

@xmfcx Thank you for your thoughts.

It is a really good idea to have

UUID for ACC

so it will be easy to distinguish nodes which were registered to an old ACC.

Here is another case to consider. What if node has some issues during start-up and it will not be able to register? If so ACC will not know that node should even exist. Maybe ACC should have access to list of launching nodes?

As I remember you did some work in this direction, didn't you?

We shouldn't need a re-register service. Just a register service should be enough. I might also be missing something, what do you think?

It seems so for me. At least for now I think all cases can be reduced to register if node can receive information that ACC has died.

Do you think that heartbeat to all all AutowareNodes is a better solution than to have service call to AutowareNodes from new ACC? Re-registering should not be a frequent case. It is kinda emergency. We will not have any additional overhead but it will be significant during the re-registration process in normal situation.

In the case of ACC heartbeat we will always have overhead.

lexavtanke avatar Dec 22 '23 08:12 lexavtanke

Expected nodes list in ACC

In the future, we can implement a feature like this. But for now, ACC doesn't need to know what will launch.

As I remember you did some work in this direction, didn't you?

I thought about this potential but didn't do anything so far.

Deregistration conditions (from AutowareNode perspective)

Let's fix the terminology first (I also edited my previous posts): https://english.stackexchange.com/questions/25931/unregister-vs-deregister

state=unregistered -> 'register' -> state=registered -> 'deregister' -> state=unregistered`

We could have multiple conditions for deregistration.

  • deregister service
  • timeout from the ACC heartbeat deadline

We should definitely have the deregister service. And we can keep the ACC heartbeat implementation on hold for now.

xmfcx avatar Dec 25 '23 05:12 xmfcx

Latest updates.

Heartbeat publishing was added to Autoware node class. It works as regular ros publisher on node_name/heartbeat topic with with additional QoS settings. So subscriber will be able to receive signals if that settings will be violated.

Autoware Control Center will subscribe to Autoware node heartbeat topic during node registration and will store pointers to subscriber and qos settings for particular Autoware node in node registry along with node_name and uuid. So registration callback in Autoware Control Center node register service should be updated.

lexavtanke avatar Jan 15 '24 06:01 lexavtanke

@xmfcx Could you please elaborate?

2. reports in last call milliseconds

As for now there is heartbeat topic for each instance of Autoware node. ACC subscribes to this topic after successful node registration. For now callback of the sub does nothing just print heartbeat timestamp for debugging. https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L209

Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.

Also each of heartbeat subscribers has callback on the event of violation QoS (liveliness_callback). https://github.com/autowarefoundation/autoware.core/blob/125e31f27997db5b52803c1cf8a3e7fff19b1f7a/common/autoware_control_center/src/autoware_control_center.cpp#L191

We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.

lexavtanke avatar Jan 17 '24 14:01 lexavtanke

  1. reports in last call milliseconds

I've meant "time passed since last heartbeat" here.

Do you think that ACC should have a data structure to store last received time stamp from each node? So it will be possible to provide a report with list of nodes, their liveliness and last received heartbeat timestamp. It can be done with providing separate srv.

The ACC(Autoware Control Center) should keep track of the last heartbeat received for each AN(AutowareNode) and periodically report the time passed since last beat received from the AN.

Maybe we could have a report message (autoware_control_center_msgs::AutowareNodeReport) in the ACC that lists the:

  • Node Identifier: Unique ID or name for each AN
  • Connectivity State:
    • Connected: Active communication link.
    • Disconnected: Communication lost.
    • Never Connected: No record of past connection.
  • Health State:
    • Unknown: Status unclear due to communication issues or lack of updates.
    • Healthy: Operating within normal parameters.
    • Warning: Operating under suboptimal conditions.
    • Error: Critical issues requiring immediate attention.
  • Time Since Last Heartbeat

And we would have an array of these messages in a single message called autoware_control_center_msgs::AutowareNodeReports which is periodically published.

We can use only this callback to manage liveliness of nodes in this data structure. If so we will update status of the node in this callback and store timestamp of the violation. If there is no violation we will not write to the structure.

This is a good idea but this callback will be probably called very often once a node dies. So it might be better to avoid adding a liveliness callback to the heartbeat messages. Instead, when we receive a heartbeat, just record the stamp. And when generating a report, we can calculate the time passed since last communication for each AN.

xmfcx avatar Jan 18 '24 12:01 xmfcx

@xmfcx Thank you for your thoughts!

Maybe we could have a report message (autoware_control_center_msgs::AutowareNodeReport) in the ACC that lists the:

Yes, I have the same idea. And I have started to implement it.

this callback will be probably called very often once a node dies

Actually it called only on state change as I tested it before. So It called ones then node died. But I will think how it will be better to implement it.

lexavtanke avatar Jan 19 '24 12:01 lexavtanke

Fixed double calling deregister service during ACC start-up procedure. AutowareNodeReports publishing was implemented in ACC. Error state receiver (service) in ACC is a wip.

lexavtanke avatar Jan 29 '24 06:01 lexavtanke

Implemented AutowareNodeError service in ACC. Implemented AutowareNodeError client in AutowareNode.

For now AutowareNode is able to send it state to ACC in form of AutowareNodeState msg with 4 main states:

  • UNKNOWN
  • NORMAL
  • WARNING
  • ERROR

lexavtanke avatar Feb 05 '24 06:02 lexavtanke

Implemented monitored subscription in AutowareNode https://github.com/autowarefoundation/autoware.core/pull/73/commits/b59c925846e450742d54397ed3aa25acead5cf57 but It produce error during run time in test_node

/home/alex/autoware.core/install/test_node/lib/test_node/test_node: symbol lookup error: /home/alex/autoware.core/install/test_node/lib/libtest_node_core.so: undefined symbol: _ZN13autoware_node12AutowareNode24test_create_subscriptionIN8std_msgs3msg7String_ISaIvEEESt5_BindIFMN9test_node8TestNodeEFvSt10shared_ptrIS6_EEPS9_St12_PlaceholderILi1EEEES5_N6rclcpp12SubscriptionIS6_S5_S6_S6_NSJ_23message_memory_strategy21MessageMemoryStrategyIS6_S5_EEEESN_EESA_IT2_ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNSJ_3QoSEOT0_RKNSJ_32SubscriptionOptionsWithAllocatorIT1_EENT3_9SharedPtrE
[ros2run]: Process exited with failure 127

Test_node is not able to call function from AutowareNode. It seem connected to linking library error.

lexavtanke avatar Feb 12 '24 06:02 lexavtanke

With @xmfcx we identified the root cause of the issue to be the separation of the declaration and definition of a templated function across .cpp and .hpp files.

This approach led to linkage problems. By consolidating both the declaration and definition within the .hpp file and eliminating the definition from the .cpp file, we were able to resolve the issue effectively.

I use some utility functions for work with uuid from tier4_autoware_utility package. There was a build error because of tier4_autoware_utils as it is the part of autoware.universe and it is not presented in autoware.core.

We talked with @xmfcx about the issue and decided to port the uuid functions to autoware_common as it autoware.core already depend on it. I put uuid helper functions to autoware_utils. It was done in the PR .

For now the PR is ready for review but there are some drawbacks:

  • I didn't switch RCLCPP_INFO messages to RCLCPP_DEBUG for more easy testing.
  • There is the boost dependency issue. It seems like there is no boost lib installed in the container.
  • I think it is better to introduce Lifecycle control in a next PR after this one will be merged and we try to use it to implement some nodes to see how it works with a real package.

lexavtanke avatar Feb 19 '24 06:02 lexavtanke

Instruction how-to launch autoware_control_center and autoware_node (test_node).

There is no dedicated launch file for autoware_control_center. So you need to run it with command: ros2 run autoware_control_center autoware_control_center

After you run it you need to start single test autoware node by command: ros2 run test_node test_node or 4 instances with different names and namespace by launch file: ros2 launch test_node autoware_nodes.launch Also you can start node first.

There is startup timer (for 10 s) in ACC which will send deregistration call (only once) to all autoware_nodes presented in the system in the case if previous ACC has died and no one node will registered during this period. If there is no nodes in the system nothing will happen. ACC will just stay alive and wait until some node will registered.

ACC publish topic /autoware_control_center/autoware_node_reports with current state of registered nodes.

There are simple publisher node (test_pub) with proper Qos profile to test monitored subscription function. By default it publish to /topic with proper frequency of 10 Hz and it is possible to change it with param hz. To run test_pub use command: ros2 run test_node test_pub --ros-args -p hz:=10

Each test_node has monitored subscriber configured to receive messages on /topic with 10 hz if this condition is violated node will call service in ACC and will report error state.

lexavtanke avatar Feb 19 '24 15:02 lexavtanke

hi @lexavtanke , I just followed your instructions and tried the test node. Everything works fine and the nodes defined in "autoware_nodes.launch" are registered or de-registered when I start or shut down the launch file. However, when I want to run a single test node using

ros2 run test_node test_node

It looks okay, but after I shut down the node and restarted it, the log shows below

图片

The left window represents the ACC node, and the right represents the test node. There is always an error log even though the ACC looks normal.

Based on this attempt, I comment out the "test_node1", "test_node2" and "test_node3" in "autoware_nodes.launch" and only remain the "test_node", then perform start & shutdown & restart. This time, both the ACC node and the test node look good. There is nothing special in the launch file, so I'm curious about the difference.

JianKangEgon avatar Feb 23 '24 01:02 JianKangEgon

Hi @JianKangEgon, thank you for your interest in this work and your feedback.

First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.

As in my experience I usually relaunch a whole system during a test.

For the difference between using the launch file and ros2 run test_node test_node there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console.

image

If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.

If I didn't understand you correctly could you please elaborate.

lexavtanke avatar Feb 23 '24 07:02 lexavtanke

Hi @JianKangEgon, thank you for your interest in this work and your feedback.

First of all, the case then ACC kept alive and only several autoware_nodes will be relaunched was out of the scope. So error is produced because ACC already has such a name_node in the register. It is possible to fix this behaviour if there is a need for it.

As in my experience I usually relaunch a whole system during a test.

For the difference between using the launch file and ros2 run test_node test_node there is only difference in color highlight of the terminal. It doesn't color errors and warning to yellow and red if started with the launch file. But it still prints them out to the console. image

If you launch and relaunch ACC and test_node with the same launch file you restart both of them. So the register in ACC is empty after restart.

If I didn't understand you correctly could you please elaborate.

OK, thanks for your quick reply.

Further update: in this case, is it possible to display the info log instead of the error log? Since the node has been registered, the error log may cause unexpected misunderstandings.

JianKangEgon avatar Feb 23 '24 07:02 JianKangEgon

I’ve added uuid_node last_node_state and node_report fields to AutowareNodeReport message.

I’ve added logic to manage relaunch autoware_node without relaunching autoware_control_center. If ANode is relaunched and sends register request ACC updated information (uuid) in register and increase counter field (num_registered) by one. ANode receives success code from ACC. ACC warn log message for such case changed to info message.

I’ve added call AutowareNodeError service in ANode constructor to send NORMAL state to ACC after node creation.

lexavtanke avatar Mar 04 '24 06:03 lexavtanke

@ahmeddesokyebrahim could you also review this?

xmfcx avatar Mar 19 '24 18:03 xmfcx

@lexavtanke Could you fix the CI check errors?

brkay54 avatar Mar 20 '24 10:03 brkay54

brkay54  Hi, which one? If you are talking about uncrustify. It works a bit weird here as all errors are introduced by style(pre-commit): autofix

lexavtanke avatar Mar 20 '24 15:03 lexavtanke

@lexavtanke I mean uncrustify and DCO. I saw the problem for uncrustify. Thank you.

brkay54 avatar Mar 20 '24 15:03 brkay54

brkay54 DCO errors belong to @xmfcx commits.

lexavtanke avatar Mar 20 '24 15:03 lexavtanke

Everything look very fine.

I have only 1 question. Is there a reason we are not using doxygen comments like that

// For methods/functions
/**
 * @brief Brief description of the method.
 *
 * Detailed description (if needed) goes here.
 * @param[in] param1 Description of parameter 1.
 * @param[out] param2 Description of parameter 2 (if any).
 * @return Description of return value (if any).
 */

// For variables
/**
 * @brief Brief description of the variable.
 *
 * Detailed description (if needed) goes here.
 */

This style is already used in Autoware BTW, and here is one example.

Finally, thanks a lot for the amazing efforts and great work done in this PR :+1:

ahmeddesokyebrahim avatar Apr 02 '24 04:04 ahmeddesokyebrahim