rushstack icon indicating copy to clipboard operation
rushstack copied to clipboard

Fix more instances of the child_process.on('exit') race condition

Open octogonz opened this issue 1 month ago • 0 comments

Summary

Following up from PR https://github.com/microsoft/rushstack/pull/4711, I audited the Rush Stack code base to look for any other invocations that might be affected by the same issue.

That PR did not trigger publishing of Rush, so I've also included a rush change file to publish a Rush release.

Details

In this lengthy Zulip chat we investigated a Rush build cache failure on Windows Subsystem for Linux (WSL) that turned out to be a race condition in the way that child_process.on('exit') was handled.

Background info

The sequence of events is like this:

  • child_process.on('data') is being used to collect STDERR and STDOUT
  • child_process.on('exit') fires indicating that the process has terminated
  • child_process.on('data') continues to fire a few more times as STDERR/STDOUT finish flushing 👈👈👈
  • child_process.on('close') finally fires indicating that the flush is complete, and includes the same exit code information that was available in 'exit'

Note that child_process.on('error') can also fire, either up front to indicate spawning failed entirely (in which case there is no exit event at all), or after some data has been received to indicate other kinds of errors.

Therefore, child_process.on('exit') is redundant and does NOT normally need to be handled at all; its main value is if the parent wants to be notified as early as possible.

The bug

The race condition bug occurs when functions are handling data to collect output, but handle exit by rejecting immediately. In this case, their STDERR/STDOUT can get truncated if the flushing is not complete.

This race condition is somewhat rare. In the case of https://github.com/microsoft/rushstack/pull/4711, the affected code had existed for years and years without any issues being reported. But on Windows Subsystem for Linux, @akres encountered a situation where it caused a noticeable malfunction and was reproducible.

How it was tested

I did not try to reproduce the race condition in the modified code. However I did confirm that removing the 500ms delay does not break Rundown.

octogonz avatar May 17 '24 19:05 octogonz