Build failure masked as a RUN_ERROR
tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/job/info {"id": "7647116", "sys_name": "chicoma"}tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/status 1698862272.972506 STATUS_CREATED Created status file. 1698862272.984902 CREATED Test directory and status file created. 1698862272.990294 BUILD_CREATED Builder created. 1698862272.995562 CREATED Test directory setup complete. 1698862278.854854 BUILD_WAIT Waiting on lock for build 4804c9b55cc8e944. 1698862278.859590 BUILDING Starting build 4804c9b55cc8e944. 1698862278.930399 BUILDING Extracting tarfile /usr/projects/hpctest/test_src/ior.tgz for build /usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944 1698862279.017952 BUILD_ERROR Error setting up build directory '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944': Error extracting file '/usr/projects/hpctest/test_src/ior.tgz'\n Could not extract tarfile '/usr/projects/hpctest/test_src/ior.tgz' into '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944': [Errno 2] No such file or directory: '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944/./doc/sphinx/userDoc/tutorial.rst' 1698862369.929424 SCHEDULED Test kicked off (individually) under slurm scheduler with 500 nodes. 1698862386.513605 PREPPING_RUN Converting run template into run script. 1698862386.514956 RUNNING Starting the run script. 1698862386.518336 RUN_ERROR Unknown error while running test. Refer to the kickoff log. tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/job/kickoff.log
Yeah that's annoying I'll look into it.
Hey Dan, I think the cause is the lack of atomic file creation on Chicoma. So presumably the only issue on Pavilion's part is that the error was misreported.
This comes from Line 427 in the build method in lib/pavilion/builder.py:
if not self._build(self.path, cancel_event, test_id, tracker):`
In the _build method from line 516 in builder.py:
try:
self._setup_build_dir(build_dir, tracker)
except TestBuilderError as err:
tracker.error(
note=("Error setting up build directory '{}': {}"
.format(build_dir, err)))
return False
This fails, returns False, and writes error messages to status from error messages at extract.py: 134, builder.py:713, builder.py:520. It returns False and triggers the failpath logic but does not trigger a cancel event and so the show continues. I don't understand why it doesn't trigger a cancel event. But it seems like there's some underlying logic that doesn't cause the test run to cancel on a build error. You built these structures @pflarr, so what was your thought process and what types of BUILD_ERRORS will cause the runs to cancel?
That it doesn't trigger a cancel is almost certainly a bug. This has only been popping up with the atomic write issue on the Shasta filesystems though, so it went undetected for quite a while.
Right. That's because Cray Shasta systems are the only systems where builder._setup_build_dir fails. So you never see this BUILD_ERROR in other contexts. Easy fix.
Ok, actually. Looking closer at it. I think it's fixed already. See the passage below from builder.py:TestBuilder.build.
with lockfile.LockFilePoker(lock):
# Attempt to perform the actual build, this shouldn't
# raise an exception unless something goes terribly
# wrong.
# This will also set the test status for
# non-catastrophic cases.
if not self._build(self.path, cancel_event, test_id, tracker):
try:
self.path.rename(self.fail_path)
except FileNotFoundError as err:
tracker.error(
"Failed to move build {} from {} to "
"failure path {}"
.format(self.name, self.path,
self.fail_path), err)
try:
self.fail_path.mkdir()
except OSError as err2:
tracker.error(
"Could not create fail directory for "
"build {} at {}"
.format(self.name, self.fail_path, err2))
if cancel_event is not None:
cancel_event.set()
return False
if self._build returns False. Which it does in the original case (where the status file shows 'Error setting up build directory'), and the cancel_event is not None (it's a threading.Event type), then cancel_event should get set. Perhaps the version you were using had something missing there, but it should work as far as I can tell. If you can recreate it with the current master, let me know and I'll poke at it.