ClusterRunner Functional test fails due to race condition on Windows

test_shutdown_all_slaves_while_build_is_running_should_finish_build_then_kill_slaves failed on AppVeyor. See full output below.

From looking at the logs, I think this may just be an annoying side effect from running two slave processes simultaneously on the same host on Windows. Both slaves try to do COPY build_setup.txt !MY_SUBJOB_FILE! during build setup and my guess is that two processes trying to copy the same file at the same time makes Windows unhappy.

Multiple slave processes on the same host are not forbidden, but they will share some things like repo clones, which can lead to surprising behavior. But since there aren't really any use cases for running two slave processes on the same host (outside of functional testing) I think we should view this as an issue with the test and not as a bug with ClusterRunner itself.

[2017-07-05 21:20:08.780] 1592 WARNING MasterTornadoTh tornado.access  404 PUT /v1/slave/1 (127.0.0.1) 0.00ms
[2017-07-05 21:20:08.827] 1348 NOTICE  Bld1-Sub0       directory       Command exited with non-zero exit code.
Command: set ATOM_ID=0&& set BUILD_EXECUTOR_INDEX=0&& set PROJECT_DIR=C:\Users\appveyor\AppData\Local\Temp\1\tmpqjsa93r7&& set ARTIFACT_DIR=C:\Users\appveyor\AppData\Local\Temp\1\tmpa26_fv33\artifacts\1\artifact_0_0&& set MACHINE_EXECUTOR_INDEX=0&& set EXECUTOR_INDEX=0&& set SUBJOB_NUMBER=1&& echo Doing subjob !SUBJOB_NUMBER!. && ping 127.0.0.1 -n 2 >nul && set MY_SUBJOB_FILE=!PROJECT_DIR!\subjob_file_!SUBJOB_NUMBER!.txt && COPY build_setup.txt !MY_SUBJOB_FILE! >nul && echo subjob !SUBJOB_NUMBER!.>> !MY_SUBJOB_FILE!
Exit code: 1
Console output: Doing subjob 1. 
The process cannot access the file because it is being used by another process.

======================================================================
FAIL: test_shutdown_all_slaves_while_build_is_running_should_finish_build_then_kill_slaves (test.functional.master.test_shutdown.TestShutdown)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\clusterrunner\test\functional\master\test_shutdown.py", line 62, in test_shutdown_all_slaves_while_build_is_running_should_finish_build_then_kill_slaves
    self.assert_build_has_successful_status(build_id=build_id)
  File "C:\projects\clusterrunner\test\framework\functional\base_functional_test_case.py", line 106, in assert_build_has_successful_status
    self.assert_build_status_contains_expected_data(build_id, expected_successful_build_params)
  File "C:\projects\clusterrunner\test\framework\functional\base_functional_test_case.py", line 92, in assert_build_status_contains_expected_data
    'Build status API response should contain the expected status data.')
nose.proxy.AssertionError: Mismatched values: 'result', expected: 'NO_FAILURES', actual: 'FAILURE' : Build status API response should contain the expected status data.
-------------------- >> begin captured stdout << ---------------------
[2017-07-05 21:20:38.067] 1880 NOTICE  MainThread      functional_test Gracefully killing process with pid 1592...
[2017-07-05 21:20:38.067] 1880 NOTICE  MainThread      functional_test Gracefully killing process with pid 1348...
[2017-07-05 21:20:38.067] 1880 NOTICE  MainThread      functional_test Gracefully killing process with pid 1516...
--------------------- >> end captured stdout << ----------------------

Jul 06 '17 01:07 josephharrington

Yeah, dealing with exclusive file handles for shared resources is something that the user of ClusterRunner should be cognizant about when deploying with multiple executors.

If we wanted to avoid this in our functional tests, we could do something hacky like add a sleep ($EXECUTOR_INDEX * 5) in our test so that we can stagger when they try to grab the file.

Jul 06 '17 18:07 tjlee0909

That's a good point -- this could still be an issue even with only a single slave on a host because of multiple executors.

Jul 06 '17 19:07 josephharrington

ClusterRunner ClusterRunner copied to clipboard

Functional test fails due to race condition on Windows

ClusterRunner
ClusterRunner copied to clipboard