grasp icon indicating copy to clipboard operation
grasp copied to clipboard

Large case version

Open jongrumer opened this issue 7 years ago • 12 comments

In order to solve for (optimise on) all eigenstates belonging to a group of configurations having more than "a few" f-electrons (-holes) - such as eg Dy I, II or Gd I, II - we have to modify many of the main routines (rmcdhf, jj2lsj, rlevels, rci, rbiotransform, rtransition etc...).

Typically the codes (in particular the eigenvalue solvers) are not set up to deal with more than 999 eigenstates / block, limits which set by locally hardcoded parameters in the codes. This together with various formatting issues (eg. that numbers become to long, missing high J-value labels etc) causes the codes to crash for many large cases.

The main motivation is to be able to perform calculations on configurations of the Lanthanides (open 4f-systems).

Gediminas @gaigalas has extended jj2lsj and the codes which depend on its output (rlevels + transition codes). I have managed to successfully optimise on blocks of about 5000 eigenstates using a modified version of rmcdh (mainly CNUM/CLEVELS and N1000).

In addition, and from a more long-term perspective, Per @tspejo suggested to dig up the FEAST diagonalisation method (to replace Davidson/Lapack) which was implemented in an earlier version of GRASP by Per Andersson. We should try to get this version running. For now I keep it in my personal repo: https://github.com/jongrumer/grasp_dev_feast

Update: Andreas Stathopoulos suggests, via @cffischer, that we switch to his more robust PRIMME routine.

Update: I've started adding Gediminas and my own smaller mods to build a well documented large case version on a fork of the grasp2018 repo under my own user: https://github.com/jongrumer/grasp2018/tree/jg/large I'll keep it like that for now. Feel free to join in if you want to!

jongrumer avatar Nov 12 '18 15:11 jongrumer

Dear Jon,

What means “… a bit with help from Gediminas”? Do you improve these programs more?

Gediminas

gaigalas avatar Nov 12 '18 15:11 gaigalas

Hi Gediminas! Nice to hear from you!

I have worked more on RMCDHF, RCI and RTRANSITION (the latter two gives weird results for cases with +1999 eigenstates/block), but not on JJ2LSJ and the modifications you did on RLEVELS and the transition codes in Malmö, that part seems to work fine :) I haven't had time to look at this now during the past month, but will start working with it again. I will keep you in the loop! I will add all the modifications you've done to the development branch I have created for the large-case version of GRASP2018 (https://github.com/compas/grasp2018/tree/dev_large_jon) and clearly give you credit for the additions you did. I will do this is the coming days.

@gaigalas You can chat here https://github.com/compas/grasp2018/issues/2 instead of sending emails, if you want to. A bit easier.

Cheers, Jon

jongrumer avatar Nov 12 '18 16:11 jongrumer

Helo Jon,

Thank you very much for information.:-) I will look at this version. It can be interesting for me as well.

Best wishes, Gediminas

gaigalas avatar Nov 12 '18 16:11 gaigalas

@gaigalas Yes! I will let you know when I have something that is a bit more stable :)

jongrumer avatar Nov 12 '18 16:11 jongrumer

Hollo,

But my modifications of jj2lsj, rlevels, and rtransition work fine? Please tell me if you have some problems with them.

Gediminas

gaigalas avatar Nov 12 '18 16:11 gaigalas

@gaigalas Yes, that part seems fine! It's more on the diagonalization side of things where I have most problems currently. As well as in the calculation of the matrix elements in rtransition.

By the way Gediminas: you should try to write directly in the chat on github instead of sending emails, it's a bit easier and then the rest of the group don't have to read our conversation - they might be bored hehe ;) Just klick on this link: https://github.com/compas/grasp2018/issues/2 (you might have to log in!)

jongrumer avatar Nov 13 '18 13:11 jongrumer

Our GRASP code has never been properly parameterized. It has always been easier to "pick a number" than think in terms of a parameter that can easily be modified. The other solution is to make dimensions adaptive -- this might mean analysing the problem to find out what size is needed. For example, the input data for diagonalization could be analyzed to see what size of memory is needed. There is no reason why a maximum of 1000 (or 999?) cannot be changed to 10000 but that has implications on the memory that will be requested even for small cases. It is always better to be adaptive because then small cases will not need lots of memory.

By the way, Andreas Stathopoulos suggests we switch to his more robust PRIMME routine. It its a C program which is not code I ever want to read. We may test it in 2019.

cffischer avatar Nov 26 '18 16:11 cffischer

Hello Gediminas, Jon, and Charlotte (hello everyone) Kai and I are performing large-scale calculations on Th+ with Grasp2018. We encounter a serious problem at some stage for a n=8 layer optimisation run involving matrices of reasonably sizes. The error message is "Program received signal SIGBUS: Access to an undefined portion of a memory object" and appears in the first call to davidson with a total number of 7139475 CSFs (for all blocks). The variable nvecsize reaches the value of 292831473 and the error occurs for the diagonalization of the 4th block of 1286401 CSFs from which we want to get the first 59 eigenvectors. Note that we used the iccut=59 option to keep the sizes of the H matrix manageable. As far as we understand from the above messages related to this issue, our problematic case does not seem to be related with the limit of 999 eigenstates/block. So what is it? May we benefit of Jon or Gediminas modified version of rmcdhf to test it? Thanks in advance for your advice. Michel and Kai

mrgodef avatar Nov 30 '18 09:11 mrgodef

@mrgodef Maybe, I don't recognise that particular error message. I've mostly worked on getting the code to run for larger sets of eigenvalues (up ~10k) rather than large expansions. But once I have time to carefully, in a documented step-by-step manner, implement the improvements by Gediminas and myself you should try. Don't have time currently though, there's a lectorship in Malmö that should be applied to and students that need a teacher :) But hopefully I'll get back to this before Christmas. Great to see us collaborating on github btw, yay!

jongrumer avatar Nov 30 '18 12:11 jongrumer

@mrgodef The error Michel describes could also be a system error. One of the negatives about the underlying assumptions in GRASP is that it leads to a few but huge memory allocations in the range of several GBytes rather than more smaller allocations. This can be a challenge in a shared memory environment with many users. The user assumes this is contiguous memory but they system may have to assign a linked list of memory. So the first step is to see whether the message is related to our code and run it again, preferrably with more processors so that memory assignment is smaller.

cffischer avatar Nov 30 '18 16:11 cffischer

I searched the internet for the error message Michel and Kai report. Here is a helpful site https://stackoverflow.com/questions/212466/what-is-a-bus-error The comment I found interesting was:


"I'd like to add a simple explanation for both: Segmentation fault means that you are trying to access memory that you are not allowed to (e. g. it's not part of your program). However, on a bus error it usually means that you are trying to access memory that does not exist (e. g. you try to access an address at 12G but you only have 8G memory) or if you exceed the limit of usable memory. – xdevs23 Oct 3 '17 at 13:45"


So it appears that there was not enough memory available which should have resulted in an "alloc" error.

One of the things I dislike about GRASP is the way CSFs are stored. In every process, the CSF list is stored requiring a number of array of size NCFxNNNW (the largest number of orbitals). In parameter_def_M.f90 this is defined to be 127. How many do you actually have? Since NCF is so large you can save a lot of memory on every processor by compiling a version for which NNNW is smaller. The size of the matrix can be controlled by increasing the number of processors and often this helps. It is possible that you want to use the disk version where the matrix is stored on disk and then more memory is available for Dvdson. RCI will autmatically switch to the disk version if the Matrix is tool large. Maybe you want to change this parameter so that if switches to a disk version sooner.

The above assumes all CSFs are read before the processing begins. Guess that needs to be checked.

cffischer avatar Nov 30 '18 20:11 cffischer

Hello Miche,

Thank you very much for information. At the moment I am in Beijing. So, I can’t spend much time at present. But I can try to run on my computer if you will send me your case. Maybe we will be some ideas after this run.

Best wishes, Gediminas @China

From: Michel Godefroid [email protected] Sent: Friday, November 30, 2018 11:46 AM To: compas/grasp2018 [email protected] Cc: Gediminas Gaigalas [email protected]; Mention [email protected] Subject: Re: [compas/grasp2018] Large case version [branch: dev_large_jon] (#2)

Hello Gediminas, Jon, and Charlotte (hello everyone) Kai and I are performing large-scale calculations on Th+ with Grasp2018. We encounter a serious problem at some stage for a n=8 layer optimisation run involving matrices of reasonably sizes. The error message is "Program received signal SIGBUS: Access to an undefined portion of a memory object" and appears in the first call to davidson with a total number of 7139475 CSFs (for all blocks). The variable nvecsize reaches the value of 292831473 and the error occurs for the diagonalization of the 4th block of 1286401 CSFs from which we want to get the first 59 eigenvectors. Note that we used the iccut=59 option to keep the sizes of the H matrix manageable. As far as we understand from the above messages related to this issue, our problematic case does not seem to be related with the limit of 999 eigenstates/block. So what is it? May we benefit of Jon or Gediminas modified version of rmcdhf to test it? Thanks in advance for your advice. Michel and Kai

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/compas/grasp2018/issues/2#issuecomment-443149052, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ad0yZ0QjmU-5FwSvUOMzXNrZH2z01Imdks5u0P5QgaJpZM4YZ6qn.

gaigalas avatar Dec 01 '18 03:12 gaigalas