rclpy icon indicating copy to clipboard operation
rclpy copied to clipboard

:farmer: Humble TestClient.test_service_timestamps failing consistently in Connext

Open Crola1702 opened this issue 1 year ago • 6 comments
trafficstars

Bug report

Required Info:

  • Operating System:
    • Ubuntu Jammy
  • Installation type:
    • Source
  • Version or commit hash:
    • Humble
  • DDS implementation:
    • RTI Connext
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

  1. Run a build in Hci__nightly-connext_ubuntu_jammy_amd64
  2. See rclpy test regression appear

Additional information

Reference build:

  • https://build.ros2.org/job/Hci__nightly-connext_ubuntu_jammy_amd64/266/

Test regressions:

Version of the first fail (Jul 26th build#258):

  ros2/rclpy:
    type: git
    url: https://github.com/ros2/rclpy.git
    version: 993b2b6d254a923503a049c14034cde07ebc750a

Log output:

../../src/ros2/rclpy/rclpy/test/test_client.py ..F...                    [100%]

=================================== FAILURES ===================================
______________________ TestClient.test_service_timestamps ______________________

self = <test.test_client.TestClient testMethod=test_service_timestamps>

    @unittest.skipIf(
        get_rmw_implementation_identifier() == 'rmw_connextdds' and platform.system() == 'Windows',
        reason='Source timestamp not implemented for Connext on Windows')
    def test_service_timestamps(self):
        cli = self.node.create_client(GetParameters, 'get/parameters')
        srv = self.node.create_service(
            GetParameters, 'get/parameters',
            lambda request, response: response)
        try:
            self.assertTrue(cli.wait_for_service(timeout_sec=20))
            cli.call_async(GetParameters.Request())
            for i in range(5):
                with srv.handle:
                    result = srv.handle.service_take_request(srv.srv_type.Request)
                if result is not None:
                    request, header = result
>                   self.assertTrue(header is not None)
E                   AssertionError: False is not true

../../src/ros2/rclpy/rclpy/test/test_client.py:110: AssertionError

Crola1702 avatar Aug 26 '24 15:08 Crola1702

This happened before with Connext https://github.com/ros2/rclpy/pull/818, but i cannot reproduce this problem on my local environment. what is the reproduction rate for this issue with nightly CI?

fujitatomoya avatar Aug 26 '24 21:08 fujitatomoya

Sorry i did not use Connext DDS, with Connext i can now reproduce the issue with https://github.com/ros2/ros2/blob/humble/ros2.repos

fujitatomoya avatar Aug 26 '24 22:08 fujitatomoya

srv.handle.service_take_request just takes the data from RMW implementation which is Connext in this case.

according the git history, nothing from rclpy, rmw_connextdds seems related to this behavior.

root@tomoyafujita:~/ros2_ws/humble_ws# dpkg -l rti-connext-dds-6.0.1
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                  Version      Architecture Description
+++-=====================-============-============-=============================================
ii  rti-connext-dds-6.0.1 6.0.1.25-1   amd64        real-time messaging and integration platform.

@trubio-rti any thoughts?

fujitatomoya avatar Aug 26 '24 22:08 fujitatomoya

@fujitatomoya Hi, I've taken a look but I can't make much out of this issue. I'm unfortunately not working on the Connext RMW any more, sorry. I've created an internal issue and notified my colleague who's now working in the project.

trubio-rti avatar Aug 30 '24 11:08 trubio-rti

@trubio-rti have you had any updates from your colleague? Can we link them to this issue?

Crola1702 avatar Sep 09 '24 17:09 Crola1702

Hi, I'm the new person working on Connext RMW. I will look into this as soon as I have some bandwith along with some more training.

fgallegosalido avatar Sep 10 '24 13:09 fgallegosalido

I'm not able to reproduce the issue in my local environment in Humble using the Debian RTI Connext package. Can you provide me a reproducer? If you need info about what I'm doing, let me know

fgallegosalido avatar Sep 26 '24 09:09 fgallegosalido

i can reproduce with humble source build. (but cannot make it happen with rolling)

note that, reproduction rate is not great, once out of ten in my experience.

  • use https://github.com/ros2/ros2/blob/humble/ros2.repos
  • source build procedure: https://docs.ros.org/en/humble/Installation/Alternatives/Ubuntu-Development-Setup.html and https://docs.ros.org/en/humble/Installation/DDS-Implementations.html#rti-connext-version-6-0-1-amd64-only
root@tomoyafujita:~/ros2_ws/humble_ws# colcon test --event-handlers console_direct+ --retest-until-fail 100 --packages-select rclpy --ctest-args -R test_client
...
7: ../../src/ros2/rclpy/rclpy/test/test_client.py ..F...                    [100%]
7: 
7: =================================== FAILURES ===================================
7: ______________________ TestClient.test_service_timestamps ______________________
7: 
7: self = <test.test_client.TestClient testMethod=test_service_timestamps>
7: 
7:     @unittest.skipIf(
7:         get_rmw_implementation_identifier() == 'rmw_connextdds' and platform.system() == 'Windows',
7:         reason='Source timestamp not implemented for Connext on Windows')
7:     def test_service_timestamps(self):
7:         cli = self.node.create_client(GetParameters, 'get/parameters')
7:         srv = self.node.create_service(
7:             GetParameters, 'get/parameters',
7:             lambda request, response: response)
7:         try:
7:             self.assertTrue(cli.wait_for_service(timeout_sec=20))
7:             cli.call_async(GetParameters.Request())
7:             for i in range(5):
7:                 with srv.handle:
7:                     result = srv.handle.service_take_request(srv.srv_type.Request)
7:                 if result is not None:
7:                     request, header = result
7: >                   self.assertTrue(header is not None)
7: E                   AssertionError: False is not true
7: 
7: ../../src/ros2/rclpy/rclpy/test/test_client.py:110: AssertionError
7: - generated xml file: /root/ros2_ws/humble_ws/build/rclpy/test_results/rclpy/test_client.xunit.xml -
7: =========================== short test summary info ============================
7: FAILED ../../src/ros2/rclpy/rclpy/test/test_client.py::TestClient::test_service_timestamps
7: ========================= 1 failed, 5 passed in 5.36s ==========================
7: -- run_test.py: return code 1
7: -- run_test.py: verify result file '/root/ros2_ws/humble_ws/build/rclpy/test_results/rclpy/test_client.xunit.xml'
    Test #7: test_client ......................***Failed    5.97 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
pytest    =   5.97 sec*proc (1 test)

Total Test time (real) =  65.84 sec

The following tests FAILED:
	  7 - test_client (Failed)
Errors while running CTest                    
Output from these tests are in: /root/ros2_ws/humble_ws/build/rclpy/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
--- stderr: rclpy
Errors while running CTest
Output from these tests are in: /root/ros2_ws/humble_ws/build/rclpy/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
---
Finished <<< rclpy [1min 6s]	[ with test failures ]

Summary: 1 package finished [1min 7s]
  1 package had stderr output: rclpy
  1 package had test failures: rclpy

fujitatomoya avatar Sep 26 '24 19:09 fujitatomoya

I was able to reproduce it. Working on it

fgallegosalido avatar Sep 27 '24 11:09 fgallegosalido

I think the issue is in the test itself. For instance, we are checking result is not None, but the function Service::service_take_request either returns the corresponding result, or (None, None) if the service take fails. So we are basically checking for a value that result is never going to have.

Unless I'm missing something, the fix would be as simple as changing if result is not None: for if result != (None, None): or something similar. With that change, the test passes consistently again. It's important to note that rolling would benefit from this change too, even though it doesn't seem to have the same issue.

Let me know what you think.

fgallegosalido avatar Sep 30 '24 17:09 fgallegosalido

@fgallegosalido good eye, i believe that is correct! test is picking up the result from the failure of taking request data. i will check the patch with 100 loop for this test and make the PR for rclpy. as you suggested, we can have this fix into rolling and then backport all 👍

fujitatomoya avatar Oct 01 '24 05:10 fujitatomoya