rclcpp icon indicating copy to clipboard operation
rclcpp copied to clipboard

:farmer: connext test regressions in `test_subscription_topic_statistics`

Open Crola1702 opened this issue 1 year ago • 3 comments

Bug report

Required Info:

  • Operating System:
    • Linux Noble
  • Installation type:
    • Source
  • Version or commit hash:
    • Rolling, Jammy
  • DDS implementation:
    • RTI Connext
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

  1. Run a build in Rci__nightly-connext_ubuntu_noble_amd64
  2. See test regressions

Additional information

Reference build: Rci__nightly-connext_ubuntu_noble_amd64#94

Test regressions:

Log output:

[ RUN      ] TestSubscriptionTopicStatisticsFixture.test_receive_stats_for_message_no_header
RTI Connext DDS Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User [email protected] For non-production use only.
Expires on 00-jan-00 See www.rti.com for more information.
/tmp/ws/src/ros2/rclcpp/rclcpp/test/rclcpp/topic_statistics/test_subscription_topic_statistics.cpp:355: Failure
Expected equality of these values:
  kNumExpectedMessageAgeMessages
    Which is: 4
  message_age_count
    Which is: 0

/tmp/ws/src/ros2/rclcpp/rclcpp/test/rclcpp/topic_statistics/test_subscription_topic_statistics.cpp:356: Failure
Expected equality of these values:
  kNumExpectedMessagePeriodMessages
    Which is: 4
  message_period_count
    Which is: 8

[  FAILED  ] TestSubscriptionTopicStatisticsFixture.test_receive_stats_for_message_no_header (8331 ms)

...

[ RUN      ] TestSubscriptionTopicStatisticsFixture.test_receive_stats_include_window_reset
RTI Connext DDS Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User [email protected] For non-production use only.
Expires on 00-jan-00 See www.rti.com for more information.
/tmp/ws/src/ros2/rclcpp/rclcpp/test/rclcpp/topic_statistics/test_subscription_topic_statistics.cpp:404: Failure
Expected: (stats_point.data) >= (message_age_offset), actual: 100.035641 vs 1000

/tmp/ws/src/ros2/rclcpp/rclcpp/test/rclcpp/topic_statistics/test_subscription_topic_statistics.cpp:404: Failure
Expected: (stats_point.data) >= (message_age_offset), actual: 99.960970000000003 vs 1000

[  FAILED  ] TestSubscriptionTopicStatisticsFixture.test_receive_stats_include_window_reset (8331 ms)

Failures are pointing to this source code: https://github.com/ros2/rclcpp/blob/rolling/rclcpp/test/rclcpp/topic_statistics/test_subscription_topic_statistics.cpp

Crola1702 avatar Jul 24 '24 13:07 Crola1702

this problem can be reproducible with current rolling today, as reported this only happens with https://github.com/ros2/rmw_connextdds.

source timestamp is assigned by RTI Connext to libstatistic allocator, and confirmed that topic statistics on the subscription publishes the data for both message_age and message_period. there seems to be no problem at this statistics publisher process.

https://github.com/ros2/rclcpp/blob/45adf6565f99042a5fdcad7b8830e69f5efb03e5/rclcpp/include/rclcpp/topic_statistics/subscription_topic_statistics.hpp#L143-L145

on the other hand, MetricsMessageCallback is only called with source name message_period.

https://github.com/ros2/rclcpp/blob/45adf6565f99042a5fdcad7b8830e69f5efb03e5/rclcpp/test/rclcpp/topic_statistics/test_topic_stats_utils.hpp#L132-L140

that fails the test since it expects the same number of messages for each source name. it just looks like source name message_age metric message is never delivered to MetricsMessageSubscriber during this test. i would like to hear from RTI developer for this behavior.

IMO, assumption that the same exact number of messages are delivered to each source name would be bad, since there is no guarantee that multiple source name messages come in order, that i think depends on the RMW implementation.

fujitatomoya avatar Jul 30 '24 23:07 fujitatomoya

i would like to hear from RTI developer for this behavior.

@asorbini any chance that you or someone else from the RTI team may be able to shed some light here?

mjcarroll avatar Aug 08 '24 19:08 mjcarroll

@clalancette You mentioned to assign this issue to you and add label more information needed.

@asorbini A friendly ping for comments about this issue.

MichaelOrlov avatar Aug 22 '24 17:08 MichaelOrlov

@trubio-rti CC: @asorbini

this comes up with priority in this week's PMC meeting. according to https://github.com/ros2/rclcpp/issues/2588#issuecomment-2259355642, it just cannot receive the data from RTI connextdds from rmw perspective. could you taka a look at this why we are missing the data? (this only happens with RTI connextdds, but cyclone nor fastdds.)

fujitatomoya avatar Sep 17 '24 15:09 fujitatomoya

@fujitatomoya: @fgallegosalido is the one working now on the Connext RMW

trubio-rti avatar Sep 17 '24 15:09 trubio-rti

@trubio-rti thanks! @fgallegosalido let us know if you need any help.

fujitatomoya avatar Sep 17 '24 16:09 fujitatomoya

@fujitatomoya we are trying to reproduce the issue. Could you clarify which Connext version is being used or is it just the one in the CI? Thanks.

fgallegosalido avatar Sep 17 '24 16:09 fgallegosalido

@fgallegosalido that is 6.0.1.25-1.

fujitatomoya avatar Sep 17 '24 17:09 fujitatomoya

I was able to reproduce it. Working on it

fgallegosalido avatar Oct 01 '24 14:10 fgallegosalido

After a long investigation, we've found out that the issue was produced by the KEEP_LAST QoS on the topic statistics publisher. Basically, as types are unkeyed in ROS2, when two samples were sent so close to each other, the second sample would arrive to the publisher queue before the first one could be sent, therefore discarding the first one.

Changing the QoS in the test to use KEEP_ALL for the statistics publisher (in the class SubscriberWithTopicStatistics), fixes both tests (haven't had a failure in 100 runs). Basically change the constructor of the class SubscriberWithTopicStatistics by setting the history QoS to keep_all:

auto options = rclcpp::SubscriptionOptions();
options.topic_stats_options.state = rclcpp::TopicStatisticsState::Enable;
options.topic_stats_options.publish_period = publish_period;
options.topic_stats_options.qos.keep_all();

fgallegosalido avatar Oct 15 '24 13:10 fgallegosalido

@fgallegosalido thanks for the analysis, i will try and confirm it right now.

fujitatomoya avatar Oct 15 '24 17:10 fujitatomoya

@fgallegosalido i would like to confirm the connextdds behavior, can you check the comments in https://github.com/ros2/rclcpp/pull/2650?

fujitatomoya avatar Oct 15 '24 19:10 fujitatomoya

@Crola1702 do we need to backport https://github.com/ros2/rclcpp/pull/2650 to downstream distributions?

fujitatomoya avatar Oct 19 '24 21:10 fujitatomoya

@Crola1702 do we need to backport https://github.com/ros2/rclcpp/pull/2650 to downstream distributions?

Yes that's right

Crola1702 avatar Oct 28 '24 19:10 Crola1702

@fujitatomoya I opened https://github.com/ros2/rclcpp/pull/2657

Crola1702 avatar Oct 28 '24 20:10 Crola1702