grasp MPI Runtime killed on signal 9 in rangular

Hi, I'm trying to calculate a multi-reference system, which have near (sometimes over) one million states in one blocks. When I use mpi runing the rangular_mpi program, it will always report as follow.

Sorting 5039795 T coefficients ... 31 Sorting 599162331 V(k=1) coefficients ... 33 Sorting 91967402 V(k=0) coefficients ... 32 Sorting 599092522 V(k=1) coefficients ... 33

mpirun noticed that process rank 2 with PID 0 on node qy-PC exited on signal 9 (Killed).

Please help me out of this problem. Best wishes Yenoch

Jan 13 '21 13:01 YenochQin

Hi Yenoch,

A google search on "mpirun signal 9" leads me to guess that you are simply running out of memory (on node qy-PC then if you use a multi-node cluster/supercluster) - use top or htop to monitor your run, and check if data is being written to your swap drive. Then you hit your computational wall, memory-wize. Could be something else, but with the limited information you've provided it is hard for me to guide you any further. A general advice is to make sure you include correlation in careful steps, as effectively as possible and not just a bunch of core-correlation that is not necessarily improving the physical quantities you are targeting.

Cheers, Jon

Jan 13 '21 15:01 jongrumer

Hi Jon,

Thank you very much for your reply. I will try to enlarge my desktop's swap drive. But this situation also appears in our group's HPC. The HPC has 98 gigs of memory and 16 gigs of swap. Is it enough for the calculation?

Have a nice day, Yenoch

Jan 14 '21 11:01 YenochQin

No, don't enlarge the swap, that won't do anything, calculations will crash anyways. Swapping should generally be avoided at all times. The amount of required RAM is not straightforward and depends on the system your calculating, how dense the interaction matrix is and so on. As I said, you need to monitor your calculations with e.g. top or htop. A calculation can easily be expanded to require 100's of GB's of RAM, so 98GB might be a too little in your case. I recommend you to study e.g. the MCHF book (Fischer, Brage and Jönsson) on how to design an efficient correlation model. Or some other ref. Also consider discussing HPC with someone knowledgeable, like a senior researcher or maybe even better a suitable sys admin.

Cheers! Jon

Jan 14 '21 12:01 jongrumer

Oh and don't forget to also monitor the convergence of the energy levels (and whatever physical properties you are after) after each layer of MCDHF, this is crucial. But you know this I'm sure :)

Jon

Jan 14 '21 12:01 jongrumer

OK, thank you for your advise. ; )

Jan 14 '21 13:01 YenochQin

MPI Runtime killed on signal 9 in rangular_mpi

Sorting 5039795 T coefficients ... 31 Sorting 599162331 V(k=1) coefficients ... 33 Sorting 91967402 V(k=0) coefficients ... 32 Sorting 599092522 V(k=1) coefficients ... 33

mpirun noticed that process rank 2 with PID 0 on node qy-PC exited on signal 9 (Killed).