abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

elpa issue?

Open AroundPeking opened this issue 2 years ago • 10 comments

Describe the bug

I wonder what is wrong in the error reporting. I have tried to use various version ABACUS, but all showed this error. When I moved to another server, no error happened. image

Expected behavior

No response

To Reproduce

Here are my input and ouput files. ABACUS version: commit fe3e4b313db1858d62c2cd6b92102d887c8abe65 (2023.07.23) task.002.tar.gz

Environment

No response

Additional Context

No response

AroundPeking avatar Aug 10 '23 08:08 AroundPeking

@AroundPeking thanks for your report. In order to exclude possible cause from omp code, could you please try to run the job again with OMP_NUM_THREADS=1?You just need to add this before mpirun in your job submission script. Please let us know the results.

hongriTianqi avatar Aug 10 '23 23:08 hongriTianqi

The same error with OMP_NUM_THREADS=1. image

AroundPeking avatar Aug 11 '23 01:08 AroundPeking

@caic99 Would you please look into this issue?

hongriTianqi avatar Aug 14 '23 04:08 hongriTianqi

@caic99 Would you please look into this issue?

@hongriTianqi Sorry, I have no ideas on it.

caic99 avatar Aug 14 '23 07:08 caic99

@AroundPeking Has this issue been solved? If not, what is suggested for improving the performance of ABACUS later? Thank you.

hongriTianqi avatar Sep 08 '23 01:09 hongriTianqi

@AroundPeking

  1. What's your environment in building ABACUS?
  2. Have you tried toolchain method?

QuantumMisaka avatar Sep 08 '23 01:09 QuantumMisaka

  • [x] Verify the issue is not a duplicate.
  • [x] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

hongriTianqi avatar Sep 08 '23 04:09 hongriTianqi

@AroundPeking Has this issue been solved? If not, what is suggested for improving the performance of ABACUS later? Thank you.

@hongriTianqi I have not solved it yet. In fact, it seems related to my building environment since one of my working sever works normally with intel20u4 but another sever works wrong now with oneapi2022.3, while other softwares are same.

And I found it strange that it works right when I just change the STRU from a larger system of 400+ atoms in task.002.tar.gz to a smaller one of 50+ atoms in test.zip.

@AroundPeking

  1. What's your environment in building ABACUS?
  2. Have you tried toolchain method?

@QuantumMisaka

  1. here my environment: loaded gcc11.2, oneapi22.3, miniconda3, cmake3.26.3 my building: image image I am not sure whether I answer you clearly. You can ask me for more detail files.

  2. No, I have no idea with toolchain method.

AroundPeking avatar Sep 08 '23 14:09 AroundPeking

@AroundPeking Thanks. There is another issue related to size limit #2002 and #2044, that might be related to this problem.

hongriTianqi avatar Sep 09 '23 07:09 hongriTianqi

At which stage did your task failed? I tested it on my machine and the test task survived for more than 2 scf cycles. The test abacus is compiled with gcc-13.2.0, elpa-2023.05.001, openblas-0.3.23, openmpi-4.1.1 and fftw-3.3.10. It seems you uses elpa-2021.11.001 and intel compiler. May you give a try on a newer version of elpa with your remaining environment?

yizeyi18 avatar Jan 31 '24 08:01 yizeyi18

We try v3.7.5, and find this error disappears.

WHUweiqingzhou avatar Sep 14 '24 05:09 WHUweiqingzhou