pavilion2 Build failure masked as a RUN

tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/job/info {"id": "7647116", "sys_name": "chicoma"}tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/status 1698862272.972506 STATUS_CREATED Created status file. 1698862272.984902 CREATED Test directory and status file created. 1698862272.990294 BUILD_CREATED Builder created. 1698862272.995562 CREATED Test directory setup complete. 1698862278.854854 BUILD_WAIT Waiting on lock for build 4804c9b55cc8e944. 1698862278.859590 BUILDING Starting build 4804c9b55cc8e944. 1698862278.930399 BUILDING Extracting tarfile /usr/projects/hpctest/test_src/ior.tgz for build /usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944 1698862279.017952 BUILD_ERROR Error setting up build directory '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944': Error extracting file '/usr/projects/hpctest/test_src/ior.tgz'\n Could not extract tarfile '/usr/projects/hpctest/test_src/ior.tgz' into '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944': [Errno 2] No such file or directory: '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944/./doc/sphinx/userDoc/tutorial.rst' 1698862369.929424 SCHEDULED Test kicked off (individually) under slurm scheduler with 500 nodes. 1698862386.513605 PREPPING_RUN Converting run template into run script. 1698862386.514956 RUNNING Starting the run script. 1698862386.518336 RUN_ERROR Unknown error while running test. Refer to the kickoff log. tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/job/kickoff.log

Nov 01 '23 18:11 tygoetsch

Yeah that's annoying I'll look into it.

Nov 09 '23 16:11 dmageeLANL

Hey Dan, I think the cause is the lack of atomic file creation on Chicoma. So presumably the only issue on Pavilion's part is that the error was misreported.

Nov 13 '23 16:11 tygoetsch

This comes from Line 427 in the build method in lib/pavilion/builder.py:

if not self._build(self.path, cancel_event, test_id, tracker):`

In the _build method from line 516 in builder.py:

        try:
            self._setup_build_dir(build_dir, tracker)
        except TestBuilderError as err:
            tracker.error(
                note=("Error setting up build directory '{}': {}"
                      .format(build_dir, err)))
            return False

This fails, returns False, and writes error messages to status from error messages at extract.py: 134, builder.py:713, builder.py:520. It returns False and triggers the failpath logic but does not trigger a cancel event and so the show continues. I don't understand why it doesn't trigger a cancel event. But it seems like there's some underlying logic that doesn't cause the test run to cancel on a build error. You built these structures @pflarr, so what was your thought process and what types of BUILD_ERRORS will cause the runs to cancel?

Nov 13 '23 21:11 dmageeLANL

That it doesn't trigger a cancel is almost certainly a bug. This has only been popping up with the atomic write issue on the Shasta filesystems though, so it went undetected for quite a while.

Nov 14 '23 17:11 Paul-Ferrell

Right. That's because Cray Shasta systems are the only systems where builder._setup_build_dir fails. So you never see this BUILD_ERROR in other contexts. Easy fix.

Nov 14 '23 17:11 dmageeLANL

Ok, actually. Looking closer at it. I think it's fixed already. See the passage below from builder.py:TestBuilder.build.

                    with lockfile.LockFilePoker(lock):
                        # Attempt to perform the actual build, this shouldn't
                        # raise an exception unless something goes terribly
                        # wrong.
                        # This will also set the test status for
                        # non-catastrophic cases.
                        if not self._build(self.path, cancel_event, test_id, tracker):

                            try:
                                self.path.rename(self.fail_path)
                            except FileNotFoundError as err:
                                tracker.error(
                                    "Failed to move build {} from {} to "
                                    "failure path {}"
                                    .format(self.name, self.path,
                                            self.fail_path), err)
                                try:
                                    self.fail_path.mkdir()
                                except OSError as err2:
                                    tracker.error(
                                        "Could not create fail directory for "
                                        "build {} at {}"
                                        .format(self.name, self.fail_path, err2))
                            if cancel_event is not None:
                                cancel_event.set()

                            return False

if self._build returns False. Which it does in the original case (where the status file shows 'Error setting up build directory'), and the cancel_event is not None (it's a threading.Event type), then cancel_event should get set. Perhaps the version you were using had something missing there, but it should work as far as I can tell. If you can recreate it with the current master, let me know and I'll poke at it.

Nov 14 '23 22:11 dmageeLANL

Build failure masked as a RUN_ERROR