rosbag2 icon indicating copy to clipboard operation
rosbag2 copied to clipboard

:man_farmer: `rosbag2_transport` Test regressions with `cyclonedds` + flaky with `fastrtps`

Open Blast545 opened this issue 4 years ago • 10 comments

Description

New test regressions in the windows buildfarm jobs

nightly_win_rel Patient zero: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2137/

projectroot.test_record__rmw_cyclonedds_cpp
rosbag2_transport.RecordIntegrationTestFixture.receives_latched_messages

nightly_win_deb

Patient zero according to logs: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2192/ Patient zero according to CI: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2193/

projectroot.test_record__rmw_cyclonedds_cpp
rosbag2_transport.RecordIntegrationTestFixture.qos_is_stored_in_metadata
rosbag2_transport.RecordIntegrationTestFixture.records_sensor_data
rosbag2_transport.RecordIntegrationTestFixture.receives_latched_messages

Expected Behavior

All test pass, build result is SUCCESS.

Actual Behavior

The mentiones tests are failing.

Additional context

I have reasons to believe it's related to #920. I'll open a PR reverting #920 #921 and #930 to see CI results.

Partial error log:

26: [----------] 6 tests from RecordIntegrationTestFixture^M
26: [ RUN      ] RecordIntegrationTestFixture.published_messages_from_multiple_topics_are_recorded^M
26: stdin is not a terminal or conso    qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqle device. Keyboard handling disabled.[INFO] [1638535392.124965000] [rosbag2_recorder]: Press SPACE for pausing/resu\
ming^M
26: [INFO] [1638535392.125188100] [rosbag2_recorder]: Listening for topics...^M
26: [INFO] [1638535392.133262200] [rosbag2_recorder]: Subscribed to topic '/string_topic'^M
26: [INFO] [1638535392.139954900] [rosbag2_recorder]: Subscribed to topic '/array_topic'^M
26: [INFO] [1638535392.149985100] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...^M
26: [       OK ] RecordIntegrationTestFixture.published_messages_from_multiple_topics_are_recorded (1000 ms)^M
26: [ RUN      ] RecordIntegrationTestFixture.qos_is_stored_in_metadata^M
26: stdin is not a terminal or console device. Keyboard handling disabled.[INFO] [1638535393.100816600] [rosbag2_recorder]: Press SPACE for pausing/resuming^M
26: [INFO] [1638535393.100995300] [rosbag2_recorder]: Listening for topics...^M
26: [ERROR] [1638535393.101764300] [ROSBAG2_TRANSPORT]: Topic '/chatter' has more than one type associated. Only topics with one type are supported^M
...
26: [ERROR] [1638535403.095305700] [ROSBAG2_TRANSPORT]: Topic '/chatter' has more than one type associated. Only topics with one type are supported^M
26: C:\ci\ws\src\ros2\rosbag2\rosbag2_transport\test\rosbag2_transport\test_record.cpp(107): error: Value of: pub_manager.wait_for_matched(topic.c_str())^M
26:   Actual: false^M
26: Expected: true^M
26: [INFO] [1638535403.208178900] [rosbag2_recorder]: Subscribed to topic '/chatter'^M
26: [INFO] [1638535403.208383000] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...^M
26: [  FAILED  ] RecordIntegrationTestFixture.qos_is_stored_in_metadata (10456 ms)^M

The reason of the error seems to be related to the topic chatter having another type associated to it. This doesn't make sense to me, as the chatter publisher seems to be uniquely instantiated here: https://github.com/ros2/rosbag2/blob/e9686a3c194df39d0688792879d00286cdc62a96/rosbag2_transport/test/rosbag2_transport/test_record.cpp#L97

Blast545 avatar Dec 08 '21 04:12 Blast545

This would probably me more closely related to https://github.com/ros2/rosbag2/pull/924 - which implements topic filtering, including "multi-type detection". May be more fruitful to look at that one first.

Based on https://ci.ros2.org/view/nightly/job/nightly_win_deb/2193/#showFailuresLink - it seems that "graceful shutdown" stuff might also be failing? In this case, perhaps there are some orphaned test publishers from other tests, publishing to /chatter, that are of a different type than the one intended in the test? Not sure, just a hypothesis.

It may be worth considering some convention of "universally unique topic name" (e.g. use a UUID as the namespace or something) for test topics, so that there is no possibility of confusion on that particular front.

emersonknapp avatar Dec 08 '21 17:12 emersonknapp

This would probably me more closely related to #924

@emersonknapp I think it's related to #920 mainly because the timing it was merged and the time the error starting appearing coincides. However, as the test failiure relates to a state maybe left from another test, I'll take a depeer look to see where it comes from.

it seems that "graceful shutdown" stuff might also be failing?

You mean the test failures related to graceful shutdown? Those seem to be errors that are not happening anymore, you can see the latest nightly: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2204/

It may be worth considering some convention of "universally unique topic name" (e.g. use a UUID as the namespace or something) for test topics

I have mixed feelings on that front, maybe if a test requires a complete clean state it should have pre condition test checks or if a particular failure test can leave an unclean test it should test it.

Blast545 avatar Dec 14 '21 21:12 Blast545

:man_farmer: Some jobs I triggered to get info on the cause of the problem:

Windows job, forcing debug log to track others tests using the chatter topic: Build Status (The error happens, although is not logged in the job summary)

Testing if the error ocurrs when the package is tested without testing other packages: Build Status Build Status Only fails when tested with other packages.

The error only ocurrs with cycloneDDs, I wasn't able to test it only with that DDS without https://github.com/ros2/rmw_implementation/pull/201 merged.

Blast545 avatar Dec 23 '21 18:12 Blast545

I suspect @emersonknapp's hypothesis is spot on. E.g. there're plenty /chatter topic publishers in test_security tests that are terminated by launch in its current, not very nice manner.

hidmic avatar Jan 04 '22 22:01 hidmic

@hidmic What do you mean that these are not terminated in a nice manner?

Blast545 avatar Jan 05 '22 22:01 Blast545

Running a Windows CycloneDDS job to get more info: Build Status

Blast545 avatar Jan 06 '22 14:01 Blast545

What do you mean that these are not terminated in a nice manner?

@Blast545 On Windows, launch (and by extension launch_testing) stops processes by hard terminating them. There's no Unix signals equivalent there. Processes are not allowed to terminate gracefully, which sometimes causes problems. That may be the case here, but it's still a hypothesis. Haven't confirmed it yet.

hidmic avatar Jan 06 '22 14:01 hidmic

FYI this stopped appearing in the buildfarm after the default RMW was changed. See https://github.com/ros2/rmw/pull/315. @hidmic @clalancette

Blast545 avatar Feb 02 '22 14:02 Blast545

This error has appeared flakily with fastrtps on Windows, see:

https://ci.ros2.org/view/nightly/job/nightly_win_rel/2251/ https://ci.ros2.org/view/nightly/job/nightly_win_rel/2257/ https://ci.ros2.org/view/nightly/job/nightly_win_rel/2261/

https://ci.ros2.org/view/nightly/job/nightly_win_deb/2317/

It hasn't appeared in the repeated jobs though.

Blast545 avatar Apr 04 '22 22:04 Blast545

I marked you as assigned and moved this to the "in progress" section of the Humble board mainly so others don't take a look into this right now @clalancette

Blast545 avatar Apr 11 '22 17:04 Blast545

@Blast545 @clalancette Can we close this issue or it is still appear on build farm?

MichaelOrlov avatar Feb 01 '23 03:02 MichaelOrlov

@Crola1702 Can you please take a look?

Blast545 avatar Feb 01 '23 15:02 Blast545

Can we close this issue or it is still appear on build farm?

Last time it happened was on Windows release 2022-04-13. I will close this one as it's not happening anymore.

Crola1702 avatar Feb 01 '23 16:02 Crola1702