rushstack
rushstack copied to clipboard
Fix more instances of the child_process.on('exit') race condition
Summary
Following up from PR https://github.com/microsoft/rushstack/pull/4711, I audited the Rush Stack code base to look for any other invocations that might be affected by the same issue.
That PR did not trigger publishing of Rush, so I've also included a rush change
file to publish a Rush release.
Details
In this lengthy Zulip chat we investigated a Rush build cache failure on Windows Subsystem for Linux (WSL) that turned out to be a race condition in the way that child_process.on('exit')
was handled.
Background info
The sequence of events is like this:
-
child_process.on('data')
is being used to collect STDERR and STDOUT -
child_process.on('exit')
fires indicating that the process has terminated -
child_process.on('data')
continues to fire a few more times as STDERR/STDOUT finish flushing 👈👈👈 -
child_process.on('close')
finally fires indicating that the flush is complete, and includes the same exit code information that was available in'exit'
Note that child_process.on('error')
can also fire, either up front to indicate spawning failed entirely (in which case there is no exit
event at all), or after some data
has been received to indicate other kinds of errors.
Therefore, child_process.on('exit')
is redundant and does NOT normally need to be handled at all; its main value is if the parent wants to be notified as early as possible.
The bug
The race condition bug occurs when functions are handling data
to collect output, but handle exit
by rejecting immediately. In this case, their STDERR/STDOUT can get truncated if the flushing is not complete.
This race condition is somewhat rare. In the case of https://github.com/microsoft/rushstack/pull/4711, the affected code had existed for years and years without any issues being reported. But on Windows Subsystem for Linux, @akres encountered a situation where it caused a noticeable malfunction and was reproducible.
How it was tested
I did not try to reproduce the race condition in the modified code. However I did confirm that removing the 500ms delay does not break Rundown.