HiGHS icon indicating copy to clipboard operation
HiGHS copied to clipboard

v1.12.0 (git hash: 755a8e0) silent failure with concurrent processes and cbMipUserSolution handling.

Open Thell opened this issue 1 month ago • 7 comments

While testing upgrading from v1.11.0 to v.1.12.0 the whole process would just stop. No errors or anything to indicate the problem and it only happens on a particular test incident. When writing up this minimal example it was also noticed that using a .lp file does not reproduce the silent failure of the live run but using a .mps of the problem does.

Note: using 2 concurrent processes does not trigger the behavior so I left it at 7.

# A minimal reproducible example of a silent failure in HiGHS v1.12.0 (git hash: 755a8e0).
# Windows 11 powershell $LASTEXITCODE: -1073741819

# Everything stops at around 1 minute to 1.5 minutes.

mre_silent_fail.mps.txt mre_silent_fail.py mre_silent_fail.lp.txt

(MRE script doesn't account for the .txt appended for github upload)

Thell avatar Oct 29 '25 17:10 Thell

An MPS file preserves the column (and row) ordering of a model, but a .LP file doesn't preserve the column ordering, as the latter is defined by the order in which they are encountered when reading the file.

Different column ordering means different solver behaviour

jajhall avatar Oct 29 '25 18:10 jajhall

As a quick follow-up to the MRE, I have noticed that I am experiencing errors (rare and intermittent) when working with the solutions. for instance:

C:\Users\thell\Workspaces\milp-flow\src\milp_flow\optimize_par.py", line 187, in incumbent_manager
incumbent.solution[:] = e.data_out.mip_solution
~~~~~~~~~~~~~~~~~~^^^ ValueError: could not broadcast input array from shape (0,) into shape (12556,)

Sometimes it happens right at the beginning of processing but other times it can happen several hundred seconds into processing a problem. This did not happen with v1.11.0 and my guess is that perhaps something to do the modifications for Highs::setSolution().

It makes me wonder what lifetime guarantees there are for the data_out.mip_solution within callback handlers.

I've modified my code to copy every improved solution into a pre-allocated buffer (which gets handed to an 'incumbent_manager' in its own thread) to, hopefully, eliminate the issue but in v1.11.0 I was able to pass 'e' to the queue the incumbent manager uses for input and checks every 0.025s and it was always able to process the improved solutions and get the data_out.mip_solution without issue.

My first inclination was to say the changes changed the order and the mip_solution buffer was getting cleared or changed faster now, but the original post to this issue does make me question that first thought.

Thell avatar Oct 29 '25 19:10 Thell

I've reproduced what is, presumably, your error. When I run mre_silent_fail.py HiGHS segfaults, with the last logging being at 71.4s

Changes were made to data_out.mip_solution by @mathgeekcoder in #2278

jajhall avatar Oct 30 '25 14:10 jajhall

I confirmed that the segfault isn't happening during the reading or writing of the solution data by wrapping the related code with output markers and a try block on the incumbent write. Hopefully it helps in some way while isolating the issue to not be looking at the python code... 🤷

    def cbMIPImprovedSolutionHandler(e):
        solution = e.data_out.mip_solution
        if solution is None or solution.size == 0:
            return
        print("cbMIPImprovedSolutionHandler reading... ", end="")
        # Only a specific clone can update its solution buffer
        np.copyto(solution_buffers[int(e.user_data)], solution, casting="no")
        incumbent_queue.put_nowait((float(e.data_out.objective_function_value), int(e.user_data)))
        print("ok")
    def cbMIPUserSolutionHandler(e):
        clone_id = int(e.user_data)
        if not incumbent.provided[clone_id] and is_better(
            incumbent.value, e.data_out.objective_function_value
        ):
            if incumbent.lock.acquire(blocking=False):
                print("cbMIPUserSolutionHandler writing incumbent solution... ", end="")
                try:
                    if len(incumbent.solution) != e.data_in.user_solution.shape[0]:
                        logger.warning(
                            f"Size mismatch: incumbent {len(incumbent.solution)} vs expected {e.data_in.user_solution.shape[0]}"
                        )
                        incumbent.lock.release()
                        return
                    np.copyto(e.data_in.user_solution, incumbent.solution, casting="no")
                    e.data_in.user_has_solution = True
                    incumbent.provided[clone_id] = True
                finally:
                    incumbent.lock.release()
                print("ok")

and the segfault still happens but all the marker lines end with 'ok'.

Thell avatar Oct 30 '25 20:10 Thell

FYI: I've started looking into this using a debug build of highspy. I can also reproduce in release, but it's hard to diagnose the issue.

After ~30 mins (debug takes longer) I hit a "subscript out of bounds" error, which might be the cause of the seg fault in release. BUT, it's not where you think it might be. It's in HPresolve::link but in heavily nested submips, i.e., RINS -> RENS -> RINS -> presolve.

I'm still diagnosing but might not be easy.

mathgeekcoder avatar Oct 31 '25 00:10 mathgeekcoder

Attempting to solve the mre_silent_fail.mps with a single process with a random seed of 4 also causes the fault.

# This fails on i == 4
for i in range(NUM_CONCURRENT_PROCESSES):
    print("Testing with random seed", i)
    model = Highs()
    model.readModel(model_path)  # or mps if desired
    model.setOptionValue("log_to_console", True)
    model.setOptionValue("mip_feasibility_tolerance", 1e-4)
    model.setOptionValue("mip_heuristic_run_root_reduced_cost", True)
    model.setOptionValue("mip_min_logging_interval", 30)
    model.setOptionValue("mip_rel_gap", 1e-4)
    model.setOptionValue("primal_feasibility_tolerance", 1)
    model.setOptionValue("random_seed", i)
    model.setOptionValue("threads", 1)
    model.setOptionValue("time_limit", 180)
    model.solve()

Thell avatar Oct 31 '25 13:10 Thell

Should I change the title of this issue to something more accurate now since it doesn't actually involve either concurrent processes or user solution handling? Something like "v1.12.0 (git hash: 755a8e0) segfault when using a specific random_seed on a specific problem."?

Thell avatar Oct 31 '25 19:10 Thell