ESPEI icon indicating copy to clipboard operation
ESPEI copied to clipboard

MCMC step stucked using the ESPEI 0.8.3 version

Open HaihuiZhang opened this issue 4 years ago • 10 comments

Dear Brandon,

There were some problems when I using the 0.8.3 version. It has take a long time when running the MCMC step and finally stuck at the beginning, the log also empties only show some warning. Would you help me with this problem? condalist.txt

/lustre/home/acct-msezbb/msezbb/.conda/envs/espei2021/lib/python3.9/site-packages/ipopt/init.py:13: FutureWarning: The module has been renamed to 'cyipopt' from 'ipopt'. Please import using 'import cyipopt' and remove all uses of 'import ipopt' in your code as this will be deprecated in a future release. warnings.warn(msg, FutureWarning) /lustre/home/acct-msezbb/msezbb/.conda/envs/espei2021/lib/python3.9/site-packages/cyipopt/utils.py:43: FutureWarning: The function named 'setLoggingLevel' will soon be deprecated in CyIpopt. Please replace all uses and use 'set_logging_level' going forward. warnings.warn(msg, FutureWarning)

HaihuiZhang avatar May 10 '21 02:05 HaihuiZhang

Those two warnings are safe to ignore and will go away when pycalphad 0.8.5 is released.

In 0.8 and later, the initial MCMC startup time will likely be a little longer, but overall each iteration should be the same or slightly faster. Can you provide some comparisons of the time to call the likelihood function that is printed out with the verbosity set to 2? In 0.8.3 and the latest 0.7.X release that you had working?

bocklund avatar May 10 '21 03:05 bocklund

         It was stuck for almost three days with no log output, but previous versions converged in just three hours.














    
        
            
                
                    
                
                
                    
                        zhanghaihui
                        
                        
                            
                                ***@***.***

签名由 网易邮箱大师 定制

        On 05/10/2021 11:28, Brandon Bocklund wrote: 

Those two warnings are safe to ignore and will go away when pycalphad 0.8.5 is released. In 0.8 and later, the initial MCMC startup time will likely be a little longer, but overall each iteration should be the same or slightly faster. Can you provide some comparisons of the time to call the likelihood function that is printed out with the verbosity set to 2? In 0.8.3 and the latest 0.7.X release that you had working?

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/PhasesResearchLab/ESPEI/issues/189#issuecomment-836116958", "url": "https://github.com/PhasesResearchLab/ESPEI/issues/189#issuecomment-836116958", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

HaihuiZhang avatar May 10 '21 03:05 HaihuiZhang

It will take about 3 hours to finish running using 0.7.9+3.gd4625e7=dev_0. I have set the verbosity to 2. But the log shows nothing after almost 3 days of running.

INFO:espei.espei_script - espei version 0.8.2 INFO:espei.espei_script - If you use ESPEI for work presented in a publication, we ask that you cite the following paper: B. Bocklund, R. Otis, A. Egorov, A. Obaied, I. Roslyakova, Z.-K. Liu, ESPEI for efficient thermodynamic database development, modification, and uncertainty quantification: application to Cu-Mg, MRS Commun. (2019) 1-10. doi:10.1557/mrc.2019.59. TRACE:espei.espei_script - Loading and checking datasets. TRACE:espei.espei_script - Finished checking datasets

HaihuiZhang avatar May 10 '21 06:05 HaihuiZhang

TRACE:espei.espei_script - Loading and checking datasets.
TRACE:espei.espei_script - Finished checking datasets

After these steps the dask server usually starts. Maybe your desk server is not starting correctly. Can you make progress with setting scheduler: null?

bocklund avatar May 11 '21 01:05 bocklund

I use the high-performance computing center of school to run, so I don’t know how to set this up. Could you teach me how to set it up. All previous versions can run on this platform before.

HaihuiZhang avatar May 11 '21 02:05 HaihuiZhang

Can you check that turning off the scheduler works first? I want to make sure everything else is working correctly first. https://espei.org/en/latest/writing_input.html#scheduler

bocklund avatar May 11 '21 02:05 bocklund

According to the solution you provided, MCMC has started to run normally. Thank you for your help. May I ask what causes this problem? log2.txt

HaihuiZhang avatar May 11 '21 02:05 HaihuiZhang

According to the solution you provided, MCMC has started to run normally.

Great, so it looks like starting dask for parallelization was indeed the issue.

May I ask what causes this problem?

I'm not sure yet, but I think we can figure it out 🙂. ESPEI is intended to work on HPCs and works well when using one compute node without any special configuration.

  1. Are you trying to use scheduler: dask on your cluster or a scheduler file with MPI?
  2. Have you tried again with dask as the scheduler to verify that it's still not working?
  3. Are you trying to run on one node, multiple nodes? Any other relevant details from your HPC setup or batch submission file (if relevant) would be helpful.

bocklund avatar May 11 '21 12:05 bocklund

  1. Since my computer keeps reporting errors after installing Conda, I have been using the school's HPC with scheduler: dask. The MCMC stuck problem was also calculated on the school cluster.
  2. Today I re-use scheduler: dask to test for 12 hours, but it is still stuck without calculation and no log output.
  3. I used 40 cores in a node when I calculated using HPC. This is the distributed.YAML file on the cluster.

distributed.txt

HaihuiZhang avatar May 11 '21 13:05 HaihuiZhang

ESPEI basically starts a dask cluster this way:

import multiprocessing
from dask.distributed import LocalCluster, Client
cores = multiprocessing.cpu_count()
cluster = LocalCluster(n_workers=cores, threads_per_worker=1, processes=True, memory_limit=0)
client = Client(cluster)
print(client.scheduler_info())

Can you run a Python script containing with this and see it successfully start? The dask documentation may be helpful for you to review. This may require help from your HPC administrator.

bocklund avatar May 11 '21 14:05 bocklund