montepython_public icon indicating copy to clipboard operation
montepython_public copied to clipboard

Restarting MPI Runs Fails at the End

Open kabeleh opened this issue 3 years ago • 8 comments

I am facing an issue whenever I try to rerun multiple chains using MPI.

Error Message

Traceback (most recent call last):
  File "/home/MontePython/montepython_public-3.4/montepython/MontePython.py", line 40, in <module>
    sys.exit(run())
  File "/home/MontePython/montepython_public-3.4/montepython/run.py", line 45, in run
    sampler.run(cosmo, data, command_line)
  File "/home/MontePython/montepython_public-3.4/montepython/sampler.py", line 46, in run
    mcmc.chain(cosmo, data, command_line)
  File "/home/MontePython/montepython_public-3.4/montepython/mcmc.py", line 873, in chain
        9.995863e+02    9.990611e+02    9.944149e-01    7.081161e+00    2.480040e-01    1.470196e+02    1.049636e+00    2.065614e+00    8.120400e-01    1.790309e+01    5.000000e-01

#  25 steps done, acceptance rate: 0.32
    os.remove(command_line.restart)
FileNotFoundError: [Errno 2] No such file or directory: '/home/MontePython/montepython_public-3.4/output/test20210311_17/2021-03-12_25__2.txt'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
6  1642 2.295168e+00    1.160566e-01    6.986556e+01    3.012042e+00    9.703636e-01    5.319983e-02    8.456681e+01    2.691049e-01    2.956859e-01    3.981788e+01    6.723229e-01    7.274396e+00    2.547544e+02    4.318886e+01 4.763726e+01     1.235131e+02    4.862358e+00    1.048760e+01    1.247747e+01    1.490125e+01    9.607071e+01    1.091036e-01    1.536763e-01    5.213959e-01    2.286143e-01    6.404129e-01    1.868812e+00    1.000118e+03    9.993404e+02  9.990027e-01    7.316299e+00    2.481412e-01    1.474556e+02    1.037364e+00    2.032887e+00    8.087512e-01    2.372198e+01    5.000000e-01
2  1672.69      2.271037e+00    1.197619e-01    7.002441e+01    3.047526e+00    9.782641e-01    4.638705e-02    5.366766e+01    2.769166e-02    2.574980e-01    4.357077e+01    6.094755e-02    5.708142e+00    2.340916e+02    4.861721e+01  3.908820e+01    1.273558e+02    2.052170e+00    1.359758e+01    1.139469e+01    1.806915e+01    9.108566e+01    1.308254e-01    1.183487e-01    4.107536e-01    1.893368e-01    6.555052e-01    1.707165e+00    9.999033e+02 9.987294e+02     1.004462e+00    6.744949e+00    2.481238e-01    1.466858e+02    1.039027e+00    2.106316e+00    8.443838e-01    1.590343e+01    5.000000e-01
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19966,1],10]
  Exit code:    1
--------------------------------------------------------------------------

Steps to Reproduce the Error

First, I start in a fresh folder with just a single chain. Otherwise, I already get MPI errors there: python montepython/MontePython.py run -p /home/MontePython/montepython_public-3.4/input/my_base2018TTTEEE.param -o output/test20210311_17 -N 25 --superupdate 20

This produces the file 2021-03-12_25__1.txt.

Second, I run multiple chains: mpirun -np 16 python montepython/MontePython.py run -p /home/MontePython/montepython_public-3.4/input/my_base2018TTTEEE.param -o output/test20210311_17 -N 25 --superupdate 20

Sometimes this also fails, then, I just hit the same command again (and again and again and again) until it works. Sometimes this works at the first attempt. Say, it runs at the first try (which the example I'm posting here actually did) and it produces the files 2021-03-12_25__2.txt to 2021-03-12_25__17.txt. Please note that it startet at number 2.txt.

Then, I decide to have longer chains. Therefore, I run

mpirun -np 16 python montepython/MontePython.py run -r /home/MontePython/montepython_public-3.4/output/test20210311_17/2021-03-12_25__2.txt -o output/test20210311_17 -N 25 --superupdate 20 where I specify the output with the smallest integer that belongs to this chain (2.txt). When I choose file number one (1.txt) it fails right away. But with file number two, the run starts and runs for a usual duration. When the run comes to an end, most chains will successfully report 25 steps done, acceptance rate: 0.16, except the very last one, where above error occurs. The numbers in-between the error message are probably part of the last MontePython output, which gets interrupted by the error message.

Additional Intel

I am running this on my workstation with a Ryzen 3950X on Linux Kernel 5.10 with Planck2018 data (plc_3.0) and a modified version of CLASS (based on CLASS 2.9.4) (which runs fine on its own or without parallelization in MontePython, so i doubt that CLASS is causing the issue). MontePython is on version 3.4., Python is on version 3.9.2.

Is there a simple step I am missing to successfully restart my chains? Thank you very much in advance for your help. :)

kabeleh avatar Mar 12 '21 16:03 kabeleh

Hi @kabeleh ,

Can you check with a larger sample? I mean at least for -N 10000 points. See if the problem still persists or not.

best, shan

shan1525 avatar Mar 16 '21 18:03 shan1525

I tried a few things and can now report the following:

  • Re-Starting long chains and aborting them manually (CTRL+C in the terminal) works fine.
  • Starting e.g. a -N 100000 chain and letting it finish works. Restarting the same chain with -N 50000 creates a new 150000 chain, but throws the error I stated in the initial post when the chains end.

Does it try to do some cleanup, when a restarted chain ends, like deleting the original chain? It deletes only the file number 1 of a chain (the others remain), but complains about not finding the file number 1. But file number 1 was there when I restarted the chain, in fact I dragged and dropped the number 1 file into the terminal to restart the chain!

kabeleh avatar Mar 24 '21 11:03 kabeleh

Hi @kabeleh ,

I think that when you are running the first run without any mpi , and then when you you are running the mpi you commenting the same folder , which isnt overwriting the older folder but generating with something_2.txt. , Can you try and see whether in place of output/test20210311_17 naming like this you do output/kabeleh after a single chain run . And then in the restart command do -r output/kabeleh/something_1.txt . Try to see if it works or not.

best, shan.

shan1525 avatar Mar 24 '21 12:03 shan1525

I am very sorry, but I cannot fully follow what you are suggesting. I have to start with a single chain, because, when I start with an MPI run directly, it complains not being able to create the folder. Would you suggest to create the folder manually before starting the run and then directly starting an MPI run into a fresh folder, to see if that helps?

kabeleh avatar Mar 24 '21 12:03 kabeleh

sorry for not being clear. Yes Yes, start with a fresh folder for mpi run , and then check whether it is generating the _1.txt or not.

shan1525 avatar Mar 24 '21 12:03 shan1525

Hi Kabeleh, I'll go through common workflow when launching and restarting runs and what happens, maybe this will elucidate some things.

You can create the directory and log.param with a simple run call, e.g. python montepython/MontePython.py run -p input/your_param_file.param -o chains/your_chains_directory/ -f 0 Not passing -N it will default to -N 10 and create a file named date_10__1.txt. From here you can start your chains as normal, e.g. mpirun -n 8 python montepython/MontePython.py -p input/your_param_file.param -o chains/your_chains_directory/ -c covmat/your_covmat.covmat -N 50000 --superupdate 20 (Or whatever flags you want to use). This will then create files date_50000__1.txt through date_50000__8.txt. When this is done running and you want to restart you would then do mpirun -n 8 python montepython/MontePython.py -p input/your_param_file.param -o chains/your_chains_directory/ -r chains/your_chains_directory/date_50000__1.txt -N 50000 --superupdate 20 (Note that at this point it will load the covmat and jumping factor in the directory even if you pass one). Which will create the files newdate_100000__1.txt through newdate_100000__8.txt. The ideal user behavior is that you delete the old files when the new ones have been copied (new file sizes are equal to or greater than the old ones). It is okay to let them stay for a while, but eventually they should be removed, especially across multiple restarts or before analyzing your chains. The automatic file removal is not very good currently, I apologize for that, indeed it only attempts to remove the files at the end of a successfully terminated run (i.e. one where the number of steps was reached).

I suspect most of your issues stem from the fact that you create an initial chains file with the same automatically generated name and index __1 to create the directory and log.param (I agree on many systems it is necessary to create the directory and log.param first and it is good practice). Rather don't pass -N (or at least not with the same number) so you are sure it creates the file with a different name and the actual run will start with file indexes __1 and up. As able and time permitting, clean up old files manually to make sure they don't interfere with the run or the results (it's not very crucial, but it should be done with some regularity, especially if you do short runs).

Note that you also have the ability to directly control the index with --chain-chain number <int> (passing the index), but this should not be necessary on most systems. In fact, for Metropolis-Hastings MontePython doesn't particularly benefit from MPI and you could launch everything in parallel using the --chain-number approach, but from how you describe your problem I don't think this is necessary.

Best, Thejs

brinckmann avatar Mar 24 '21 14:03 brinckmann

Also this is obviously a bug in the default behavior of the part of the code that deletes the old chains files after a successfully completed run, that it gets the indices wrong when the first chain you restarted from has an unexpected index. But one that should be easy to bypass.

Best, Thejs

brinckmann avatar Mar 24 '21 14:03 brinckmann

I see, the example I gave was bound to fail, with using the same number of steps. I apologise for the confusion it caused... However, the issue remains when I create the folder and log.param with an initial run with -N 10, make a "real run" with mpirun -np 8 and -N 100000 and try to restart the chains later with mpirun -np 8 and -N 50000. But as far as I can tell, the chains it creates are fine and it's more that the terminal output ends with the error message and that it does not automatically delete copied-over chains.

Thank you very much for writing down a detailed workflow! :)

kabeleh avatar Mar 26 '21 11:03 kabeleh