nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Interrupt and resume

Open fsmosca opened this issue 4 years ago • 8 comments

Is there a way to resume the optimization after if is interrupted. Resume means that past trial histories will be considered by the optimizer.

I also thought about saving the parameters and its objective values tried in a file and then load it when resuming the optimization. But does it support pre-loading of parameters and objectives values?

fsmosca avatar Apr 29 '21 09:04 fsmosca

Hi @fsmosca , The simplest way to address this is to use the CACHE_FILE parameter. The evaluations will be saved in a file. When NOMAD is restarted, the file is read, and the solving uses the best points found in it. This method will ensure that a point is not evaluated again. However, the MADS algorithm will restart using the original mesh size provided by the parameters, the number of evaluations counted is reset, etc. If you need NOMAD to restart from the exact state where it was when it was interrupted, you might want to look at advanced parameters HOT_RESTART_READ_FILES and HOT_RESTART_WRITE_FILES.

montplaisir avatar Apr 29 '21 16:04 montplaisir

Thanks I will try your suggestions.

fsmosca avatar Apr 29 '21 16:04 fsmosca

All right here is an example interrupt and resume using HOT_RESTART. It is in python. Please check if it is right.

Interrupt/Resume with HOT_RESTART

#!/usr/bin/python


"""
interrupt_resume.py

When an optimization is interrupted you can resume it by using the parameters
HOT_RESTART_READ_FILES and HOT_RESTART_WRITE_FILES. Set these parameters
to True. Also define a CACHE_FILE where past optimization data will be saved.

Note:
Nomad will generate hotrestart.txt, be sure to delete this file if your cache file
is not yet created to avoid segmentation fault.
"""


import PyNomad

 
def objective_f(opt_param):
    """
    Booth function:
    f = (x + 2*y - 7)**2 + (2*x + y - 5)**2
    f is 0 at x=1, y=3
    Ref: https://en.wikipedia.org/wiki/Test_functions_for_optimization

    opt_param: A list of param to be optimized.
    """
    x = opt_param.get_coord(0)
    y = opt_param.get_coord(1)

    f = (x + 2*y - 7)**2 + (2*x + y - 5)**2

    opt_param.setBBO(str(f).encode("UTF-8"))

    return 1 # 1: success 0: failed evaluation


if __name__ == "__main__":
    # params options
    bb_output_type = 'OBJ'
    max_bb_eval = 100
    bb_input_type = '* R'
    max_eval = 5000
    cache_fn = 'hot_restart_cache.txt'
    restart = True

    params = [
        f'BB_OUTPUT_TYPE {bb_output_type}',
        f'MAX_BB_EVAL {max_bb_eval}',
        f'BB_INPUT_TYPE {bb_input_type}',
        f'MAX_EVAL {max_eval}',
        f'CACHE_FILE {cache_fn}',
        f'HOT_RESTART_READ_FILES {restart}',
        f'HOT_RESTART_WRITE_FILES {restart}'
    ]
    
    # Define param init and limits.
    init_opt_param = [0., 0.]
    lb = [-10., -10.]
    ub = [10., 10.]
    
    # Start the optimization.
    best_param, best_value, _, num_evals, num_iters, _ = PyNomad.optimize(objective_f, init_opt_param, lb, ub, params)
    
    print()
    print(f'best param : {best_param}')
    print(f'best value : {best_value}')
    print(f'num evals  : {num_evals}')
    print(f'num_iters  : {num_iters}')

Interrupted after 16 evaluations

python interrupt_resume.py

BBE OBJ
1  74        *
3  74       
3  90       
5 290       
5   2        *
6 650       
7   5       
8  41       
9   5       
10  20       
12  80       
12 180       
14 104       
14 164       
15   0        *
16  18       
^C
NOMAD caught User interruption.
Please wait...
A termination criterion is reached: Ctrl-C (Base)
Save information for hot restart.
Write hot restart file.

Best feasible solution:     #674 ( 1 3 )        Evaluation OK    f =   0                         h =   0                     

Best infeasible solution:   Undefined.

Blackbox evaluations:        16
Total model evaluations:     732
Cache hits:                  2
Total number of evaluations: 18

best param : [1.0, 3.0]
best value : 0.0
num evals  : 16
num_iters  : 0

Contents of cache file hot_restart_cache.txt

CACHE_HITS 2
BB_OUTPUT_TYPE OBJ
(  -3                        1                      ) EVAL_OK ( 164.0 )
(  -2                       -2                      ) EVAL_OK ( 290.0 )
(  -2                        2                      ) EVAL_OK ( 74.0 )
(  -2                        6                      ) EVAL_OK ( 18.0 )
(   0                        0                      ) EVAL_OK ( 74.0 )
(   1                        1                      ) EVAL_OK ( 20.0 )
(   1                        2                      ) EVAL_OK ( 5.0 )
(   1                        3                      ) EVAL_OK ( 0.0 )
(   1                        7                      ) EVAL_OK ( 80.0 )
(   2                       -2                      ) EVAL_OK ( 90.0 )
(   2                        2                      ) EVAL_OK ( 2.0 )
(   2                        3                      ) EVAL_OK ( 5.0 )
(   3                       -3                      ) EVAL_OK ( 104.0 )
(   3                        4                      ) EVAL_OK ( 41.0 )
(   7                        3                      ) EVAL_OK ( 180.0 )
(   8                        8                      ) EVAL_OK ( 650.0 )

Resume

python interrupt_resume.py

Read hot restart file /home/username/mynomad/nomad/interfaces/PyNomad/./hotrestart.txt
BBE OBJ
18 290       
18 290       
20  50       
20  50       
22   9       
22   9       
23  41       
25  18       
25  18       
26   2       
27   1.625   
28   1.625   
29   0.6473  
30   0.8573  
31   0.2421  
32   0.617   
33   0.2061  
34   0.225   
35   0.1025  
36   0.0986  
37   0.2061  
38   0.0305  
39   0.0845  
40   0.0234  
41   0.0317  
42   0.0117  
43   0.0117  
44   0.0045  
45   0.009   
46   0.0017  
47   0.0026  
48   0.0005  
49   0.0018  
50   0.0002  
51   0.0009  
53   0.4905  
53   0.4905  
55   4.4105  
55   4.4105  
57   0.08    

BBE OBJ
57   0.08    
59   0.464   
59   0.464   
61   0.02    
61   0.02    
63   0.116   
63   0.116   
64   0.000113
65   0.000313
66   0.000056
67   0.000115
68   0.000036
69   0.000038
70   0.000014
71   0.000024
72   0.000007
73   0.000013
74   0.000005
75   0.000005
76   0.000002
77   0.000003
78   0.000001
80   0.025625
79   0.025625
82   0.005625
82   0.005625
83   0.000002
84   0.000001
86   0.00072 
86   0.00072 
88   0.005776
88   0.005776
89   0.000001
90   0.0     
91   0.0     
92   0.0     
93   0.0     
94   0.0     
96   0.000446
96   0.000446

BBE OBJ
98   0.000558
98   0.000558
99   0.0     
100   0.0     
A termination criterion is reached: Max number of blackbox evaluations (Eval Global) No more points to evaluate 100
Save information for hot restart.
Write hot restart file.

Best feasible solution:     #7 ( 1 3 )  Evaluation OK    f =   0                         h =   0                     

Best infeasible solution:   Undefined.

Blackbox evaluations:        100
Total model evaluations:     1892
Cache hits:                  22
Total number of evaluations: 122

best param : [1.0, 3.0]
best value : 0.0
num evals  : 100
num_iters  : 0

Final contents of cache file.

CACHE_HITS 22
BB_OUTPUT_TYPE OBJ
(  -4                        0                      ) EVAL_OK ( 290.0 )
(  -3                        1                      ) EVAL_OK ( 164.0 )
(  -2                       -2                      ) EVAL_OK ( 290.0 )
(  -2                        2                      ) EVAL_OK ( 74.0 )
(  -2                        6                      ) EVAL_OK ( 18.0 )
(  -2                        8                      ) EVAL_OK ( 50.0 )
(  -1                        2                      ) EVAL_OK ( 41.0 )
(   0                        0                      ) EVAL_OK ( 74.0 )
(   0                        2                      ) EVAL_OK ( 18.0 )
(   0                        4                      ) EVAL_OK ( 2.0 )
(   0                        5                      ) EVAL_OK ( 9.0 )
(   0.25                     3.25                   ) EVAL_OK ( 1.625 )
(   0.43999999999999994671   3.6899999999999995026  ) EVAL_OK ( 0.8572999999999981 )
(   0.5                      3.4900000000000002132  ) EVAL_OK ( 0.49050000000000016 )
(   0.51000000000000000888   2.5                    ) EVAL_OK ( 4.410500000000001 )
(   0.64000000000000001332   3.329999999999999627   ) EVAL_OK ( 0.24209999999999982 )
(   0.69000000000000005773   2.9399999999999999467  ) EVAL_OK ( 0.6472999999999993 )
(   0.80000000000000004441   3.1200000000000001066  ) EVAL_OK ( 0.07999999999999964 )
(   0.87000000000000010658   3.1199999999999996625  ) EVAL_OK ( 0.031700000000000034 )
(   0.88000000000000000444   2.7999999999999998224  ) EVAL_OK ( 0.46400000000000086 )
(   0.89000000000000012434   3.2000000000000001776  ) EVAL_OK ( 0.08450000000000052 )
(   0.89000000000000001332   3.2800000000000002487  ) EVAL_OK ( 0.2061000000000002 )
(   0.9000000000000000222    3.0600000000000000533  ) EVAL_OK ( 0.019999999999999928 )
(   0.93999999999999994671   2.8999999999999999112  ) EVAL_OK ( 0.11600000000000016 )
(   0.94999999999999995559   3.0249999999999999112  ) EVAL_OK ( 0.005625000000000027 )
(   0.95000000000000006661   3.3900000000000001243  ) EVAL_OK ( 0.6170000000000007 )
(   0.96999999999999997335   3.010000000000000675   ) EVAL_OK ( 0.0025999999999998715 )
(   0.96999999999999997335   3.0299999999999998046  ) EVAL_OK ( 0.0017999999999999765 )
(   0.97000000000000008438   3.0899999999999998579  ) EVAL_OK ( 0.023399999999999855 )
(   0.9749999999999999778    2.9500000000000001776  ) EVAL_OK ( 0.02562499999999993 )
(   0.97999999999999998224   3.0158000000000000362  ) EVAL_OK ( 0.0007202000000000124 )
(   0.98419999999999996376   2.9799999999999999822  ) EVAL_OK ( 0.0057762000000000134 )
(   0.98750000000000004441   3.0075000000000002842  ) EVAL_OK ( 0.00031250000000000445 )
(   0.98999999999999999112   2.999299999999999855   ) EVAL_OK ( 0.0005584499999999901 )
(   0.98999999999999999112   3                      ) EVAL_OK ( 0.0004999999999999787 )
(   0.99550000000000005151   3.0030000000000005578  ) EVAL_OK ( 3.8249999999994906e-05 )
(   0.99749999999999994227   2.9975000000000000533  ) EVAL_OK ( 0.00011250000000000852 )
(   0.99770000000000003126   3.0009000000000001229  ) EVAL_OK ( 1.3940000000001637e-05 )
(   0.99780000000000002025   3.0034999999999993925  ) EVAL_OK ( 2.384999999998742e-05 )
(   0.99830000000000007621   3.0020000000000002238  ) EVAL_OK ( 7.250000000002267e-06 )
(   0.99929999999999996607   3.0099999999999997868  ) EVAL_OK ( 0.0004464499999999693 )
(   0.99939999999999962199   3.0002999999999997449  ) EVAL_OK ( 8.100000000026193e-07 )
(   0.99960000000000004405   3.0004999999999997229  ) EVAL_OK ( 4.4999999999950124e-07 )
(   0.99970000000000003304   3.000100000000000211   ) EVAL_OK ( 2.5999999999967627e-07 )
(   0.99970000000000003304   3.000300000000000189   ) EVAL_OK ( 1.799999999996939e-07 )
(   0.99990000000000001101   2.9974000000000002863  ) EVAL_OK ( 3.592999999998973e-05 )
(   0.99990000000000001101   2.999499999999999833   ) EVAL_OK ( 1.700000000000425e-06 )
(   0.99990000000000001101   3.000199999999999978   ) EVAL_OK ( 9.000000000011341e-08 )
(   0.99990000000000001101   3.0011000000000001009  ) EVAL_OK ( 5.2200000000012485e-06 )
(   1                        1                      ) EVAL_OK ( 20.0 )
(   1                        2                      ) EVAL_OK ( 5.0 )
(   1                        3                      ) EVAL_OK ( 0.0 )
(   1                        3.000199999999999978   ) EVAL_OK ( 1.999999999997783e-07 )
(   1                        7                      ) EVAL_OK ( 80.0 )
(   1.0000750000000000473    2.9999999999999995559  ) EVAL_OK ( 2.8124999999735678e-08 )
(   1.000099999999999989     2.9996000000000000441  ) EVAL_OK ( 5.300000000000165e-07 )
(   1.000199999999999978     2.9998999999999993449  ) EVAL_OK ( 8.999999999958049e-08 )
(   1.0005999999999999339    3.001100000000000545   ) EVAL_OK ( 1.3130000000003103e-05 )
(   1.0006999999999999229    2.9900000000000002132  ) EVAL_OK ( 0.0004464499999999693 )
(   1.000700000000000145     2.9993000000000002991  ) EVAL_OK ( 9.799999999991624e-07 )
(   1.0008000000000001339    2.999800000000000022   ) EVAL_OK ( 2.1200000000007763e-06 )
(   1.0008999999999999009    2.9987000000000003652  ) EVAL_OK ( 3.1399999999992195e-06 )
(   1.0015999999999998238    2.9990999999999998771  ) EVAL_OK ( 5.330000000000025e-06 )
(   1.0043999999999999595    2.9944000000000001727  ) EVAL_OK ( 5.6479999999999635e-05 )
(   1.0068999999999999062    2.9969000000000005635  ) EVAL_OK ( 0.00011497999999999999 )
(   1.0099999999999997868    2.9499999999999997335  ) EVAL_OK ( 0.009000000000000202 )
(   1.0100000000000000089    2.9900000000000002132  ) EVAL_OK ( 0.00019999999999999147 )
(   1.0100000000000000089    3.000700000000000145   ) EVAL_OK ( 0.0005584499999999901 )
(   1.0158000000000000362    3.0200000000000000178  ) EVAL_OK ( 0.0057762000000000134 )
(   1.0199999999999997957    2.9900000000000002132  ) EVAL_OK ( 0.0008999999999999616 )
(   1.0200000000000000178    2.9700000000000001954  ) EVAL_OK ( 0.0016999999999999275 )
(   1.0200000000000000178    2.9841999999999999638  ) EVAL_OK ( 0.0007202000000000124 )
(   1.0249999999999999112    3.0499999999999998224  ) EVAL_OK ( 0.02562499999999993 )
(   1.0400000000000000355    3.0100000000000002309  ) EVAL_OK ( 0.011700000000000035 )
(   1.0500000000000000444    2.9599999999999999645  ) EVAL_OK ( 0.004500000000000075 )
(   1.0500000000000000444    2.9750000000000000888  ) EVAL_OK ( 0.005625000000000027 )
(   1.0600000000000000533    3.1000000000000000888  ) EVAL_OK ( 0.11600000000000016 )
(   1.0699999999999998401    2.9199999999999999289  ) EVAL_OK ( 0.01169999999999993 )
(   1.1000000000000000888    2.9399999999999999467  ) EVAL_OK ( 0.020000000000000143 )
(   1.1000000000000000888    2.9700000000000001954  ) EVAL_OK ( 0.030500000000000048 )
(   1.1000000000000000888    3.0500000000000002665  ) EVAL_OK ( 0.10250000000000042 )
(   1.1099999999999998757    2.7200000000000001954  ) EVAL_OK ( 0.2060999999999994 )
(   1.1200000000000001066    3.2000000000000001776  ) EVAL_OK ( 0.46400000000000086 )
(   1.1999999999999999556    2.8799999999999998934  ) EVAL_OK ( 0.07999999999999964 )
(   1.2099999999999999645    2.7700000000000004619  ) EVAL_OK ( 0.0985999999999997 )
(   1.25                     2.25                   ) EVAL_OK ( 1.625 )
(   1.25                     2.9500000000000006217  ) EVAL_OK ( 0.22500000000000134 )
(   1.4899999999999999911    3.5                    ) EVAL_OK ( 4.4105000000000025 )
(   1.5                      2.5099999999999997868  ) EVAL_OK ( 0.49050000000000016 )
(   2                       -2                      ) EVAL_OK ( 90.0 )
(   2                        1                      ) EVAL_OK ( 9.0 )
(   2                        2                      ) EVAL_OK ( 2.0 )
(   2                        3                      ) EVAL_OK ( 5.0 )
(   2                        4                      ) EVAL_OK ( 18.0 )
(   3                       -3                      ) EVAL_OK ( 104.0 )
(   3                        4                      ) EVAL_OK ( 41.0 )
(   4                       -2                      ) EVAL_OK ( 50.0 )
(   6                        6                      ) EVAL_OK ( 290.0 )
(   7                        3                      ) EVAL_OK ( 180.0 )
(   8                        8                      ) EVAL_OK ( 650.0 )

If this is right I would like to contribute this code as an example on interrupt and resume for python interface so other users may learn. Or I would suggest the Nomad team to create such an example in this repository.

Questions

  • What is h?
  • Why num_iters is zero? best_param, best_value, _, num_evals, num_iters, _ = PyNomad.optimize(objective_f, init_opt_param, lb, ub, params)
  • In the code I set max_eval = 5000, does Total model evaluations related to max_eval? If so how to best estimate the value of max_eval?

fsmosca avatar Apr 30 '21 02:04 fsmosca

Thank you for your suggested example, it seems to work fine! We will discuss if we add it to our examples. Would you provide us with your name, so we can add you to the contributors?

Here are answers to your questions:

  • What is h? - h is the value of the constraint violation function (for the progressive barrier). When h = 0, it means that the point is feasible, otherwise, it is infeasible.
  • Why num_iters is zero? - I would have to look into it, Probably the information is not properly updated.
  • MAX_EVAL is the number of blackbox evaluations plus the number of cache hits (i.e. points that NOMAD wants to evaluate but that are already in the cache). The model evaluations are a separate counter. They are the number of evaluations done for models like when using QUAD_MODEL_SEARCH. The only parameter for this counter is QUAD_MODEL_MAX_EVAL, which limits model evaluations, but for each time a model is used. It does not limit the total number of model evaluations.

montplaisir avatar Apr 30 '21 16:04 montplaisir

My name is Ferdinand Mosca. Thanks for the definitions, I will do some research about it.

BTW I have encountered this doc while searching of how poll work, nice illustrations. If you have a link of something like that, you may also add it on this site.

fsmosca avatar May 03 '21 02:05 fsmosca

Hi Ferdinand, Thank you for the suggestion for the documentation. We currently do not have such high level information online. It is in our plans. Viviane

montplaisir avatar May 06 '21 12:05 montplaisir

I have tried using nomad as an optimizer of the search parameters of a computer chess engine. Basically there are two engines, base_engine and test_engine. The base engine will take the best parameter values found so far while the test_engine will use the parameter values from nomad. The objective is the result of engine vs engine match at 100 games, Result is minimized and sent to nomad.

param to be optimized: ['LmrFactor', 'FutilityMargin', 'QsearchFutilityMargin']
base engine init param values: [50, 30, 50]
optimizer init param values: [50, 30, 50]

suggested param values: [50, 30, 50]
base engine param: [50, 30, 50]
actual result: 0.51, minimized result: 0.49
match done in 416.7s
BBE BLK_SIZE FRAME_CENTER OBJ
1 1 (   0   0   0 )     *
suggested param values: [50, 40, 70]
base engine param: [50, 30, 50]
actual result: 0.515, minimized result: 0.485
match done in 415.2s
2 1 (  50  30  50 )     *
suggested param values: [50, 70, 130]
base engine param: [50, 30, 50]
actual result: 0.5, minimized result: 0.5
match done in 422.6s
3 1 (  50  40  70 )    
suggested param values: [50, 40, 90]
base engine param: [50, 30, 50]
actual result: 0.445, minimized result: 0.5549999999999999
match done in 431.1s
4 1 (  50  40  70 )   1
suggested param values: [50, 40, 120]
base engine param: [50, 30, 50]
actual result: 0.47, minimized result: 0.53
match done in 424.4s
suggested param values: [50, 60, 70]
base engine param: [50, 30, 50]

...

I have set the cached file but so far there are no cached files that are saved in the folder. It turns out, the data are only saved in the file if I interrupt the optimization process by control+c for example or when the optimization is done normally. If I am hit by power interruption can nomad save the optimization data in the file? I guess it does not as the computer could no longer process the data. I would like to suggest for nomad to save the optimization data after every eval is completed or after I sent the objective value, this way if I get hit by power interruption the previous data are still there.

fsmosca avatar May 08 '21 06:05 fsmosca

Hi Ferdinand,

As you noticed, the cache file is not periodically saved. We could implement that for the next version. We are also currently working on ways that NOMAD may suggest points and then stop, so it would not have to wait for the evaluations. When the evaluations are available, NOMAD could be updated and continue by suggesting next points.

For now, as a workaround, you can use the HISTORY_FILE. It lists the evaluated points with their evaluations, in a format similar to the cache file, but slightly different. If you can convert that history file to an updated cache file, you would lose other algorithmic information (for instance, the mesh size where the algorithm is at), but you would keep information from previously evaluated points. We also are developing a way to use HISTORY_FILE seamlessly, it will be available soon.

I hope this helps,

Viviane

montplaisir avatar May 10 '21 15:05 montplaisir