v1.12.0 (git hash: 755a8e0) silent failure with concurrent processes and cbMipUserSolution handling.
While testing upgrading from v1.11.0 to v.1.12.0 the whole process would just stop. No errors or anything to indicate the problem and it only happens on a particular test incident. When writing up this minimal example it was also noticed that using a .lp file does not reproduce the silent failure of the live run but using a .mps of the problem does.
Note: using 2 concurrent processes does not trigger the behavior so I left it at 7.
# A minimal reproducible example of a silent failure in HiGHS v1.12.0 (git hash: 755a8e0).
# Windows 11 powershell $LASTEXITCODE: -1073741819
# Everything stops at around 1 minute to 1.5 minutes.
mre_silent_fail.mps.txt mre_silent_fail.py mre_silent_fail.lp.txt
(MRE script doesn't account for the .txt appended for github upload)
An MPS file preserves the column (and row) ordering of a model, but a .LP file doesn't preserve the column ordering, as the latter is defined by the order in which they are encountered when reading the file.
Different column ordering means different solver behaviour
As a quick follow-up to the MRE, I have noticed that I am experiencing errors (rare and intermittent) when working with the solutions. for instance:
C:\Users\thell\Workspaces\milp-flow\src\milp_flow\optimize_par.py", line 187, in incumbent_manager
incumbent.solution[:] = e.data_out.mip_solution
~~~~~~~~~~~~~~~~~~^^^ ValueError: could not broadcast input array from shape (0,) into shape (12556,)
Sometimes it happens right at the beginning of processing but other times it can happen several hundred seconds into processing a problem. This did not happen with v1.11.0 and my guess is that perhaps something to do the modifications for Highs::setSolution().
It makes me wonder what lifetime guarantees there are for the data_out.mip_solution within callback handlers.
I've modified my code to copy every improved solution into a pre-allocated buffer (which gets handed to an 'incumbent_manager' in its own thread) to, hopefully, eliminate the issue but in v1.11.0 I was able to pass 'e' to the queue the incumbent manager uses for input and checks every 0.025s and it was always able to process the improved solutions and get the data_out.mip_solution without issue.
My first inclination was to say the changes changed the order and the mip_solution buffer was getting cleared or changed faster now, but the original post to this issue does make me question that first thought.
I've reproduced what is, presumably, your error. When I run mre_silent_fail.py HiGHS segfaults, with the last logging being at 71.4s
Changes were made to data_out.mip_solution by @mathgeekcoder in #2278
I confirmed that the segfault isn't happening during the reading or writing of the solution data by wrapping the related code with output markers and a try block on the incumbent write. Hopefully it helps in some way while isolating the issue to not be looking at the python code... 🤷
def cbMIPImprovedSolutionHandler(e):
solution = e.data_out.mip_solution
if solution is None or solution.size == 0:
return
print("cbMIPImprovedSolutionHandler reading... ", end="")
# Only a specific clone can update its solution buffer
np.copyto(solution_buffers[int(e.user_data)], solution, casting="no")
incumbent_queue.put_nowait((float(e.data_out.objective_function_value), int(e.user_data)))
print("ok")
def cbMIPUserSolutionHandler(e):
clone_id = int(e.user_data)
if not incumbent.provided[clone_id] and is_better(
incumbent.value, e.data_out.objective_function_value
):
if incumbent.lock.acquire(blocking=False):
print("cbMIPUserSolutionHandler writing incumbent solution... ", end="")
try:
if len(incumbent.solution) != e.data_in.user_solution.shape[0]:
logger.warning(
f"Size mismatch: incumbent {len(incumbent.solution)} vs expected {e.data_in.user_solution.shape[0]}"
)
incumbent.lock.release()
return
np.copyto(e.data_in.user_solution, incumbent.solution, casting="no")
e.data_in.user_has_solution = True
incumbent.provided[clone_id] = True
finally:
incumbent.lock.release()
print("ok")
and the segfault still happens but all the marker lines end with 'ok'.
FYI: I've started looking into this using a debug build of highspy. I can also reproduce in release, but it's hard to diagnose the issue.
After ~30 mins (debug takes longer) I hit a "subscript out of bounds" error, which might be the cause of the seg fault in release. BUT, it's not where you think it might be. It's in HPresolve::link but in heavily nested submips, i.e., RINS -> RENS -> RINS -> presolve.
I'm still diagnosing but might not be easy.
Attempting to solve the mre_silent_fail.mps with a single process with a random seed of 4 also causes the fault.
# This fails on i == 4
for i in range(NUM_CONCURRENT_PROCESSES):
print("Testing with random seed", i)
model = Highs()
model.readModel(model_path) # or mps if desired
model.setOptionValue("log_to_console", True)
model.setOptionValue("mip_feasibility_tolerance", 1e-4)
model.setOptionValue("mip_heuristic_run_root_reduced_cost", True)
model.setOptionValue("mip_min_logging_interval", 30)
model.setOptionValue("mip_rel_gap", 1e-4)
model.setOptionValue("primal_feasibility_tolerance", 1)
model.setOptionValue("random_seed", i)
model.setOptionValue("threads", 1)
model.setOptionValue("time_limit", 180)
model.solve()
Should I change the title of this issue to something more accurate now since it doesn't actually involve either concurrent processes or user solution handling? Something like "v1.12.0 (git hash: 755a8e0) segfault when using a specific random_seed on a specific problem."?