rosbag2 icon indicating copy to clipboard operation
rosbag2 copied to clipboard

Gracefully handle CTRL+C and CTR+Break events on Windows

Open MichaelOrlov opened this issue 1 year ago • 34 comments

  • Add handler for the native Windows CTRL_C_EVENT, CTRL_BREAK_EVENT and CTRL_CLOSE_EVENT in rosbag2 recorder and player.

  • Map SIGINT signal to the native Windows CTRL_C_EVENT in the stop_execution(handle, signum) version for Windows to be able correctly use start and stop execution routines from unit tests.

  • Uses CREATE_NEW_PROCESS_GROUP when creating new process in process execution helper functions to be able to properly send native Windows CTRL_C_EVENT to the newly spawned child process.

  • Creates new process suspended and resumes it after adding to the newly created job. To avoid a potential race condition where a newly created process starts a subprocess before we've called AssignProcessToJobObject();

  • Enable integration tests that were disabled for Windows due to the incorrect sending and handling of the SIGINT event.

  • Got rid of the finalize_metadata_kludge(..) helper function.

  • ~~Depends from #1301~~

  • Relates #927

  • Closes #926

  • Closes #1326

  • Closes #1270

Note: According to the https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signal?view=msvc-170

SIGINT is not supported for any Win32 application. When a CTRL+C interrupt occurs, Win32 operating systems generate a new thread to specifically handle that interrupt.

The SIGILL and SIGTERM signals aren't generated under Windows. They're included for ANSI compatibility. Therefore, you can set signal handlers for these signals by using signal, and you can also explicitly generate these signals by calling raise.

The CTRL_CLOSE_EVENT is an analog of the SIGTERM from POSIX. Windows sends CTRL_CLOSE_EVENT to all processes attached to a console when the user closes the console (either by clicking Close on the console window's window menu, or by clicking the End Task button command from Task Manager). Although according to the https://learn.microsoft.com/en-us/windows/console/generateconsolectrlevent the GenerateConsoleCtrlEvent(..) doesn't support sending CTRL_CLOSE_EVENT. There is no way to directly send CTRL_CLOSE_EVENT to the process in the same console application. Therefore, added SIGTERM to the unsupported events in the stop_execution(process_handle, signum) API.

  • Update: Removed dependencies from #1301 and rebased on top of the latest Rolling to be able to move forward with this PR and easily backport it to the Iron and Humble.

MichaelOrlov avatar May 22 '23 00:05 MichaelOrlov

I wonder if this is something that we should put into rclcpp rather than here. That is, it seems to me that all ROS 2 binaries would benefit from this kind of fix.

I think it makes sense to put a version of this in ROS 2, but I think @MichaelOrlov wants to get this fix in for Iron's version of rosbag2, so it seems reasonable to do it here as well for now. Maybe in the longer term if we get something landed in rolling, he can deprecate it here to reduce redundancy?

mjcarroll avatar May 22 '23 12:05 mjcarroll

@clalancette, I agree with @mjcarroll and yes I would like to keep it in rosbag2 to be able to backport it to the iron branch. And will get rid of redundancy when it will be available in the rclcpp layer.

MichaelOrlov avatar May 22 '23 17:05 MichaelOrlov

And will get rid of redundancy when it will be available in the rclcpp layer.

@MichaelOrlov would you be willing to make the patch to add the helpers to rclcpp?

emersonknapp avatar May 25 '23 23:05 emersonknapp

@MichaelOrlov would you be willing to make the patch to add the helpers to rclcpp?

@emersonknapp Yes. I can try to make the changes in rcpcpp layer but a bit later on, it might need to make changes on the rclpy side as well. I think after merging this PR.

MichaelOrlov avatar May 28 '23 19:05 MichaelOrlov

  • Update: Removed dependencies from https://github.com/ros2/rosbag2/pull/1301 and rebased on top of the latest Rolling to be able to move forward with this PR and easily backport it to the Iron and Humble. @clalancette @emersonknapp Please review when you will have time.

MichaelOrlov avatar Jun 03 '23 20:06 MichaelOrlov

Gist: https://gist.githubusercontent.com/MichaelOrlov/55adb4c4d5dac5435adace5c095d45b2/raw/09cc03757951ddc3c5ec8885a4d766a84f26207f/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_py rosbag2_test_common ros2bag rosbag2_transport rosbag2_tests TEST args: --packages-above rosbag2_py rosbag2_test_common ros2bag rosbag2_transport rosbag2_tests ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12152

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

MichaelOrlov avatar Jun 03 '23 20:06 MichaelOrlov

Returning back to draft due to some unexpected test failures on CI which are not reproduced locally and need to be investigated.

MichaelOrlov avatar Jun 05 '23 07:06 MichaelOrlov

Re-run CI after rebase and adjusting expectations in flaky ros2bag record tests ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12197

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

MichaelOrlov avatar Jun 11 '23 08:06 MichaelOrlov

It seems that CI failure on windows in rosbag2_tests.TestPlayEndToEnd/PlayEndToEndTestFixture. relates to the way how we are running tests on Windows CI. It fails with the error messages

C:\ci\ws\install\include\rosbag2_test_common\rosbag2_test_common\process_execution_helpers_windows.hpp(177): error: Value of: GenerateConsoleCtrlEvent(CTRL_C_EVENT, handle.process_info.dwProcessId)
  Actual: false
Expected: true
Error code:1. Error message: Incorrect function.

The error code is very weird and I can't reproduce this failure on my local Windows + ROS2 Rolling setup. However, I found a discussion on the docker-related forum Can't call GenerateConsoleCtrlEvent system call in Windows containers](https://github.com/docker/for-win/issues/3173) which is describes exactly the same problem with the GenerateConsoleCtrlEvent system call and its Error code:1. Error message: Incorrect function. According to one of the reply

@lox have you tried Windows Server 2019? It is fixed in the ltsc2019 image. I can confirm this works in ltsc2019 images, closing!

It seems this issue has been fixed on the Windows Server 2019 @mjcarroll @clalancette What version of the Windows and docker containers are we using for running our Windows CI?

MichaelOrlov avatar Jun 12 '23 07:06 MichaelOrlov

@nuclearsandwich If I am not mistaken I recall that we had a plan to move to the Windows Server 2019 on our Windows CI builds.
Could you please comment about it if it is true? I came across with an issue that calling to GenerateConsoleCtrlEvent returns Incorrect function error when running it on Windows 10 under the docker image.

MichaelOrlov avatar Jun 13 '23 20:06 MichaelOrlov

Re-running CI after rebase since on today's ROS2 weekly synch meeting was mentioned that buildfarm CI already migrated to the Windows Server 2020.
Will see if the issue still persists. ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12301 Building on the built-in node in workspace /var/lib/jenkins/jobs/ci_launcher/workspace

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

MichaelOrlov avatar Jun 27 '23 16:06 MichaelOrlov

Re-run Windows CI job after adding ctrl_c_event_can_be_send_and_received sanity test

  • Windows Build Status

MichaelOrlov avatar Jul 02 '23 07:07 MichaelOrlov

Hi @claraberendsen, You mentioned to poing you if the issue with Windows not being able to send the GenerateConsoleCtrlEvent will reproduce after re-run on the fresh Windows build. It is still failing https://ci.ros2.org/job/ci_windows/19807/ I have added the "minimum valuable reproducer" ctrl_c_event_can_be_send_and_received test to show that issue reproduces even with a simple application with an endless loop without dependencies from other packages. Please refer to my latest commit Add basic ctrl_c_event_can_be_send_and_received test for details.

The tests which are failing on the CI are passes on my local Windows setup without docker container. Here is the link to a similar issue related to Windows under the docker Can't call GenerateConsoleCtrlEvent system call in Windows containers](https://github.com/docker/for-win/issues/3173) use case.

It seems someone needs to contact Microsoft Support for this issue. Since I have no more ideas on how to fix this issue. Could you please handle this issue or forward it to someone who can handle it?

MichaelOrlov avatar Jul 02 '23 23:07 MichaelOrlov

It seems someone needs to contact Microsoft Support for this issue. Since I have no more ideas on how to fix this issue. Could you please handle this issue or forward it to someone who can handle it?

@MichaelOrlov Thanks for pinging me I'm going to take a look at this and see what I find and if needed be forward it.

claraberendsen avatar Jul 04 '23 19:07 claraberendsen

@MichaelOrlov I can confirm that we are using the new WS2022 image. In the logs of the build you referenced above you can see the change on the url for the windows image to point to /windows/server which is a new built image for WS2022 (See as reference to the previous 2019 image).

claraberendsen avatar Jul 17 '23 15:07 claraberendsen

@claraberendsen As regards

I can confirm that we are using the new WS2022 image

I know. I have seen it from logs. The problem is that issue reproduces even on the new WS2022 image.
This is why I mentioned that

It seems someone needs to contact Microsoft Support for this issue. Since I have no more ideas on how to fix this issue.

MichaelOrlov avatar Jul 17 '23 16:07 MichaelOrlov

@MichaelOrlov We do not currently have the bandwidth to investigate this further. I see you pointed that in your local machine you are not seeing this error only when running the CI because we utilize docker. I don't have clarity that the issue comes from using docker or the docker image of our setup. To verify that the issue comes from it being ran inside docker further testing should be conducted by running this on a WS2019 image and a WS2022 image. I'm pinging @clalancette to see if he can help unblock this.

claraberendsen avatar Jul 19 '23 15:07 claraberendsen

@claraberendsen I don't see a rationale in trying to run the test on the different WS2019 and a WS2022 images under the docker.
The fact that on our latest WS2022 image in the docker container, any call to GenerateConsoleCtrlEvent returns an Incorrect function error is enough.
IMO the next step would be to reach out to Microsoft and say: "We are using your licensed latest WS2022 image under the docker container version ... and any call to the GenerateConsoleCtrlEvent returns an Incorrect function error". This behavior doesn't correspond to your spec written here https://learn.microsoft.com/en-us/windows/console/generateconsolectrlevent"

MichaelOrlov avatar Jul 19 '23 18:07 MichaelOrlov

I'm pinging @clalancette to see if he can help unblock this.

I don't have any particular knowledge about what is going on with Windows. Maybe @ooeygui can help get us in touch with the right people?

clalancette avatar Jul 19 '23 18:07 clalancette

@clalancette I'm looking into it.

ooeygui avatar Jul 20 '23 19:07 ooeygui

@clalancette I'm looking into it.

Thank you, much appreciated!

clalancette avatar Jul 20 '23 19:07 clalancette

@ooeygui Friendly ping. Any thoughts here?

clalancette avatar Aug 22 '23 15:08 clalancette

@clalancette I've been trying to track this down internally - it looks like there was a fix, but it was backed out due to compatibility issues. I am trying to find an alternative method, but there does not appear to be a supported graceful solution.

ooeygui avatar Aug 28 '23 18:08 ooeygui

@clalancette I found the owner of the API. It looks like my statement about a reverted fix was incorrect, that this should work.

I'm working on a small repro to root cause.

ooeygui avatar Sep 18 '23 16:09 ooeygui

Both the owner of the Windows API and I have been attempting to reproduce this, but haven't been able to. Does this still repro for you?

ooeygui avatar Sep 21 '23 19:09 ooeygui

@ooeygui Originally this issue was not reproducible on my local setup. But as you can see CI jobs fail.

MichaelOrlov avatar Sep 21 '23 20:09 MichaelOrlov

Thanks @MichaelOrlov. @clalancette Do you know where I can find the windows container spec? I'd like to try building it locally.

ooeygui avatar Sep 21 '23 20:09 ooeygui

Do you know where I can find the windows container spec? I'd like to try building it locally.

Our Windows containers are currently running on Windows Server 2022 hosts running in AWS EC2. We're using the Docker engine for Windows in the non-virtualized mode so we can take advantage of shared mounts between the container and host system (this forces our containers and the host Windows version to be the same, unlike the virtualized container engine which does not support shared filesystem mounts between host and container) on ci.ros2.org you can see our dockerfile template here: https://github.com/ros2/ci/blob/master/windows_docker_resources/Dockerfile and you can inspect the logs to see the build arguments that a given job is run with. Within the container most of the heavy lifting for setup and configuration is managed by this chef cookbook https://github.com/ros-infrastructure/ros2-cookbooks/tree/latest/cookbooks/ros2_windows

nuclearsandwich-ai avatar Sep 21 '23 23:09 nuclearsandwich-ai

@ooeygui @nuclearsandwich-ai Any updates on this issue?

cc: @clalancette

MichaelOrlov avatar Dec 27 '23 23:12 MichaelOrlov

Hi @MichaelOrlov, I spoke with the owner of the API which exhibits this issue. We have not been able to repduce this issue, so cannot debug it. Discussing the runtime environment, we believe the root cause is the version difference between the container and the host (Windows containers behave differently than linux containers; which requires that they be the same version of the OS)

In a future version of Windows, the API has been refactored so it should not exhibit this problem, but this depends on migrating the host and clients CI to a more current version of Windows.

I'd recommend disabling the test on Windows until this migration takes place.

ooeygui avatar Jan 04 '24 03:01 ooeygui