HiGHS Longer runtimes for ILP models created with `addVariable` versus `addCol`

For integer linear program models created via highspy v1.7.1, it seems that models created with addCol run a bit faster than models created with addVariable. The timing data below show this occurs for models with > 5000 variables, and occurs for several different randomly-generated models.

Figure_1

I am surprised by this behavior. I would have expected that both the addCol and addVariable methods would create the same underlying model, and that the run times would be the same (up to random variations).

This is not a barrier to my use of HiGHS -- I can just use addCol, like PuLP does. However, I hope that noting this might (1) help other users speed up their models or (2) help the maintainers identify a performance improvement.

The script I used to generate the timing data is below:

"""This script creates an example integer linear program using highspy's `addCol` and `addVariable` methods.

It measures the time taken for the `run` method for models created using the `addCol` versus the
`addVariable` methods for several problem sized and writes the results to a CSV file.

At highspy version 1.7.1, these measurements surprisingly show that models created with `addCol`
solve about 1.5x faster.
"""

import random
import time

import highspy


def create_constraints(h: highspy.Highs, n1: int, n2: int, seed: int):
    """Create some random constraints for the example problem.
    Passing the same `seed` will create the same constraints.

    This shared function is used below to create the same constraints for both
    the `addCol` and `addVariable` models.
    """
    random.seed(seed)

    for j in range(n2):
        ind = [n2 * i + j for i in range(n1)]
        val = [random.uniform(0.5, 1.0) for _ in range(n1)]
        a_target = random.uniform(60.0, 90.0)
        h.addRow(a_target - 0.05, a_target + 0.05, n1, ind, val)

        ind = [n2 * i + j for i in range(n1)]
        val = [random.uniform(0.5, 1.0) for _ in range(n1)]
        b_target = random.uniform(60.0, 90.0)
        h.addRow(b_target - 0.05, b_target + 0.05, n1, ind, val)

    for i in range(n1):
        h.addRow(
            -highspy.kHighsInf,
            random.uniform(10.0, 100.0),
            n2,
            [n2 * i + j for j in range(n2)],
            [random.uniform(1.0, 10.0) for _ in range(n2)],
        )


def create_ilp_addcol(n1: int, n2: int, seed: int) -> highspy.Highs:
    """Create an integer linear program using highspy's `addCol` method."""
    h = highspy.Highs()

    num_var = n1 * n2

    random.seed(seed)
    cost_coefs = [random.uniform(0.5, 1.0) for _ in range(num_var)]

    ### This is the part that is different between the two models ###
    for k in range(num_var):
        h.addCol(cost_coefs[k], 0.0, highspy.kHighsInf, 0, [], [])
        h.changeColIntegrality(k, highspy.HighsVarType.kInteger)
    ### End of the different part ###

    create_constraints(h, n1, n2, seed)

    return h


def create_ilp_addvariable(n1: int, n2: int, seed: int) -> highspy.Highs:
    """Create an integer linear program using highspy's `addVariable` method."""
    h = highspy.Highs()

    num_var = n1 * n2

    random.seed(seed)
    cost_coefs = [random.uniform(0.5, 1.0) for _ in range(num_var)]

    ### This is the part that is different between the two models ###
    for k in range(num_var):
        h.addVariable(lb=0, name=f"z{k}", type=highspy.HighsVarType.kInteger)
    h.changeColsCost(num_var, range(num_var), cost_coefs)
    ### End of the different part ###

    create_constraints(h, n1, n2, seed)

    return h


def main():
    timing_results_file = "timing_results.csv"
    with open(timing_results_file, "w") as f:
        f.write("num_var,solve_time_addcol,solve_time_addvariable\n")

    for n1, n2 in [(250, 1), (500, 2), (1000, 5), (2000, 5), (2000, 7), (2000, 10)]:
        for seed in (986, 43985, 28):
            num_var = n1 * n2

            prob = create_ilp_addcol(n1, n2, seed)
            prob.setOptionValue("time_limit", 100.0)
            prob.setOptionValue("mip_rel_gap", 0.01)
            t1 = time.perf_counter()
            prob.run()
            solve_time_addcol = time.perf_counter() - t1

            prob = create_ilp_addvariable(n1, n2, seed)
            prob.setOptionValue("time_limit", 100.0)
            prob.setOptionValue("mip_rel_gap", 0.01)
            t1 = time.perf_counter()
            prob.run()
            solve_time_addvariable = time.perf_counter() - t1

            with open(timing_results_file, "a") as f:
                f.write(f"{num_var},{solve_time_addcol},{solve_time_addvariable}\n")


if __name__ == "__main__":
    main()

Jun 28 '24 19:06 mvernacc

I am surprised by this behavior. I would have expected that both the addCol and addVariable methods would create the same underlying model, and that the run times would be the same (up to random variations).

In what way do you believe that the underlying model they create is different?

addVariable (see https://github.com/ERGO-Code/HiGHS/blob/0095d61a2e37df88cc8616b374350c5be260e862/src/highspy/highs.py#L214) does a few scalar operations in Python before calling Highs::addCol in C++, whereas addCol in Python (as you've used it) is bound directly to Highs::addCol. Since both then do the same few operations in Highs::addCol, maybe these operations make the difference.

Conicidentally, today I discussed addVariable with the author of https://github.com/ERGO-Code/HiGHS/blob/0095d61a2e37df88cc8616b374350c5be260e862/src/highspy/highs.py, and maybe this difference in behaviour will be eliminated as a consequence.

Jun 28 '24 20:06 jajhall

@jajhall

In what way do you believe that the underlying model they create is different?

The difference I observed is that the run method takes longer for models created using addVariable. Note that the timing data in my original comment is for the calling run method on the resulting models, and does not include the time to create the models with addCol and addVariable.

The run calls have log outputs that are identical except for the timing information (examples below). If this just happened once, I'd attribute it to interruptions of the process or differences in CPU temperature or something. But, the run call is repeatably slower when the model was made with addVariable.

I guess that, somehow, building the model with addVariable makes each LP iteration slower, because the logs show the same number of LP iterations for both models. I don't know enough about the internals of HiGHS to guess how that could be.

Example logs

Created with addCol:

Running HiGHS 1.7.1 (git hash: 0c240d8): Copyright (c) 2024 HiGHS under MIT licence terms
Coefficient ranges:
  Matrix [5e-01, 1e+01]
  Cost   [5e-01, 1e+00]
  Bound  [0e+00, 0e+00]
  RHS    [1e+01, 1e+02]
Presolving model
2020 rows, 20000 cols, 60000 nonzeros  0s
2020 rows, 20000 cols, 60000 nonzeros  0s

Solving MIP model with:
   2020 rows
   20000 cols (644 binary, 19356 integer, 0 implied int., 0 continuous)
   60000 nonzeros

        Nodes      |    B&B Tree     |            Objective Bounds              |  Dynamic Constraints |       Work
     Proc. InQueue |  Leaves   Expl. | BestBound       BestSol              Gap |   Cuts   InLp Confl. | LpIters     Time

         0       0         0   0.00%   0               inf                  inf        0      0      0         0     0.1s
         0       0         0   0.00%   432.0800779     inf                  inf        0      0      2        38     0.2s
 L       0       0         0   0.00%   432.2513314     433.3646117        0.26%     1512     86    146       331     6.4s

Solving report
  Status            Optimal
  Primal bound      433.364611672
  Dual bound        432.251331421
  Gap               0.257% (tolerance: 1%)
  Solution status   feasible
                    433.364611672 (objective)
                    0 (bound viol.)
                    4.27213819876e-13 (int. viol.)
                    0 (row viol.)
  Timing            6.42 (total)
                    0.10 (presolve)
                    0.00 (postsolve)
  Nodes             1
  LP iterations     540 (total)
                    0 (strong br.)
                    293 (separation)
                    209 (heuristics)

Created with addVariable:

Running HiGHS 1.7.1 (git hash: 0c240d8): Copyright (c) 2024 HiGHS under MIT licence terms
Coefficient ranges:
  Matrix [5e-01, 1e+01]
  Cost   [5e-01, 1e+00]
  Bound  [0e+00, 0e+00]
  RHS    [1e+01, 1e+02]
Presolving model
2020 rows, 20000 cols, 60000 nonzeros  0s
2020 rows, 20000 cols, 60000 nonzeros  0s

Solving MIP model with:
   2020 rows
   20000 cols (644 binary, 19356 integer, 0 implied int., 0 continuous)
   60000 nonzeros

        Nodes      |    B&B Tree     |            Objective Bounds              |  Dynamic Constraints |       Work
     Proc. InQueue |  Leaves   Expl. | BestBound       BestSol              Gap |   Cuts   InLp Confl. | LpIters     Time

         0       0         0   0.00%   0               inf                  inf        0      0      0         0     0.1s
         0       0         0   0.00%   432.0800779     inf                  inf        0      0      2        38     0.2s
 L       0       0         0   0.00%   432.2513314     433.3646117        0.26%     1512     86    146       331    10.4s

Solving report
  Status            Optimal
  Primal bound      433.364611672
  Dual bound        432.251331421
  Gap               0.257% (tolerance: 1%)
  Solution status   feasible
                    433.364611672 (objective)
                    0 (bound viol.)
                    4.27213819876e-13 (int. viol.)
                    0 (row viol.)
  Timing            10.43 (total)
                    0.10 (presolve)
                    0.00 (postsolve)
  Nodes             1
  LP iterations     540 (total)
                    0 (strong br.)
                    293 (separation)
                    209 (heuristics)

Jun 28 '24 22:06 mvernacc

I've reproduced the timing difference for run() that you observe

However, when writing out the respective models to verify that they are identical, the meaningful difference in the two running times disappeared.

That the differences in model building cause significantly different run times for the same model is intriguing, but not something that I'm in a position to investigate at this time.

Jun 28 '24 23:06 jajhall

Apologies, I was travelling in the last 2 weeks. I can replicate, however it's not addVariable that's causing the performance hit. It's actually the variable naming, i.e., h.addVariable(lb=0, name=f"z{k}", type=highspy.HighsVarType.kInteger).

If you remove the name parameter, or if you add h.passColName(k, f"z{k}") in the create_ilp_addcol function, then the performance is similar.

Note 1: I've not yet investigated why adding variable names impacts performance. Note 2: we are aware of some 'model building' performance issues with addVariable.

Jul 09 '24 20:07 mathgeekcoder