abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

Fifty integrated test cases failed on Intel env

Open hongriTianqi opened this issue 1 year ago • 5 comments

Describe Current Status and Possible Solution

The following 50 test cases failed on intel machine.

1: 107_PW_outWfcR
1: 107_PW_W90
1: 109_PW_CR_fix_abc
1: 111_PW_elec_add
1: 116_PW_scan_Si2
1: 116_PW_scan_Si2_nspin2
1: 127_PW_15_PK_AF
1: 150_PW_15_CR_VDW3
1: 184_PW_BNDKPAR_SDFT_ALL
1: 184_PW_BNDKPAR_SDFT_MALL
1: 201_NO_KP_DJ_CF_CS_GaAs
1: 201_NO_KP_DJ_Si
1: 208_NO_KP_CS_CR
1: 213_NO_mulliken
1: 215_NO_sol_H2
1: 216_NO_scan_Si2
1: 250_NO_KP_CR_VDW2
1: 250_NO_KP_CR_VDW3
1: 250_NO_KP_CR_VDW3ABC
1: 250_NO_KP_CR_VDW3BJ
1: 281_NO_KP_HSE
1: 282_NO_KP_HSE_complex
1: 283_NO_KP_HF
1: 284_NO_KP_PBE0
1: 285_NO_KP_RE_HSE
1: 286_NO_KP_CR_HSE
1: 283_NO_restart
1: 307_NO_GO_OH
1: 381_NO_GO_S1_HSE
1: 382_NO_GO_S2_HSE
1: 383_NO_GO_SO_HSE
1: 384_NO_GO_S1_HSE_loop0_PU
1: 385_NO_GO_RE_S1_HSE
1: 386_NO_GO_MD_S1_HSE
1: 601_NO_TDDFT_CO_occ
1: 801_PW_LT_sc
1: 802_PW_LT_fcc
1: 803_PW_LT_bcc
1: 804_PW_LT_hexagonal
1: 805_PW_LT_trigonal
1: 806_PW_LT_st
1: 807_PW_LT_bct
1: 808_PW_LT_so
1: 809_PW_LT_baco
1: 810_PW_LT_fco
1: 811_PW_LT_bco
1: 812_PW_LT_sm
1: 813_PW_LT_bacm
1: 814_PW_LT_triclinic
1: 824_NO_LT_fco

Additional Context

No response

hongriTianqi avatar Aug 18 '23 01:08 hongriTianqi

Details can be checked in the following txt file. We need to further check each case to spot out the issues respectively. unit_test_intel.txt

hongriTianqi avatar Aug 18 '23 01:08 hongriTianqi

@hongriTianqi I had a detailed study on the first case "107_PW_outWfcR" and found that it might be mainly a format / post-processing issue.

An integrated test invoked by Autotest.sh (1) calls abacus to run some job first, (2) uses the bash script tools/catch_properties.sh to extract & calculate some values, and (3) compare the values with those in result.ref.

In "107_PW_outWfcR", I find that abacus finish the job normally; the problem is in step-2 where tools/catch_properties.sh read OUT.Autotest/running_scf.log to get the total number of fft grid (line 256):

image

The variable "allgrid" is zero after line 256, which makes further calculation in line 259 yield inf. This is definitely not what we want, but it does faithfully do what command asks for. The following figure shows the place in running_scf.log from which the number is calculated:

image

As we can see, by using "=" and "," as delimiters, field 2 is "[ 24" instead of "24", thereby causing $2*$3*$4 to be 0.

I notice that format in running_scf.log has a recent change in https://github.com/deepmodeling/abacus-develop/pull/2605. Prior to this commit, the same information was displayed as

image

In this situation, $2*$3*$4 would yield the right number.

This test case can be fixed by either changing the output format of "fft grid for wave functions" back before or change the awk command in post-processing. However, I'm not able to make the desicion because each one has the potential to cause the failure of other tests. The decision should be left to people who can assess the situation and consequence of such modification. @dyzheng @hongriTianqi

jinzx10 avatar Aug 26 '23 06:08 jinzx10

  • [x] Understand the problem or question described by the user.
  • [x] Check if the issue is a known problem or has been addressed in the documentation.
  • [x] Test the issue or problem on a similar system or environment, if possible.
  • [ ] Identify the root cause or provide clarification on the user's question.
  • [ ] Provide a step-by-step guide, including any necessary resources, to resolve the issue or answer the question.
  • [ ] If the issue is related to documentation, update the documentation to prevent future confusion (optional).
  • [ ] If the issue is related to code, consider implementing a fix or improvement (optional).
  • [ ] Review and incorporate any relevant feedback from users or developers.
  • [ ] Ensure the user's issue is resolved or their question is answered and close the ticket.

hongriTianqi avatar Sep 08 '23 04:09 hongriTianqi

Update at 2023/11/01. now there are 40 failed case tests on Intel machine:

107_PW_outWfcR

[WARNING ] variance_wfc_r_0_0 cal=inf ref=0.31340000 deviation=0.31340000

##107_PW_W90 Compare Error: line 4, column 4 1: diamond.amn: 0.103213607284 1: OUT.autotest/diamond.amn: -0.103213607298 1: [WARNING ] CompareAMN_pass cal=1.00000000 ref=0.00000000 deviation=-1.00000000

109_PW_CR_fix_abc

totalstressref cal=358.02119700 ref=358.01774800 deviation=-0.00344900

111_PW_elec_add

totalstressref cal=2329.02942100 ref=2329.02954500 deviation=0.00012400

116_PW_scan_Si2

[WARNING ] etotref cal=-204.07304372 ref=-204.11142252 deviation=-0.03837880

116_PW_scan_Si2_nspin2

[WARNING ] etotref cal=-204.07300485 ref=-204.11317477 deviation=-0.04016992

127_PW_15_PK_AF

[WARNING ] etotref cal=-6141.07713468 ref=-6141.07775057 deviation=-0.00061589

150_PW_15_CR_VDW3

[WARNING ] totalforceref cal=0.86635600 ref=0.85304200 deviation=-0.01331400

184_PW_BNDKPAR_SDFT_ALL

[WARNING ] totalforceref cal=197.98036000 ref=197.98018600 deviation=-0.00017400

184_PW_BNDKPAR_SDFT_MALL

[WARNING ] totalforceref cal=197.98010600 ref=197.97993000 deviation=-0.00017600

201_NO_KP_DJ_CF_CS_GaAs

[WARNING ] totalforceref cal=144.90600000 ref=144.90617600 deviation=0.00017600

201_NO_KP_DJ_Si

[WARNING ] etotref cal=-227.56204871 ref=-227.60068424 deviation=-0.03863553

208_NO_KP_CS_CR

[WARNING ] totalstressref cal=340.58392200 ref=340.58455200 deviation=0.00063000

213_NO_mulliken

[WARNING ] totalforceref cal=4.74220400 ref=4.74258800 deviation=0.00038400

215_NO_sol_H2

[WARNING ] etotref cal=-32.74207985 ref=-32.74077066 deviation=0.00130919

216_NO_scan_Si2

[WARNING ] etotref cal=-203.91086440 ref=-203.96022423 deviation=-0.04935983

250_NO_KP_CR_VDW2

[WARNING ] etotref cal=-4262.64283604 ref=-4262.70382264 deviation=-0.06098660

250_NO_KP_CR_VDW3

[WARNING ] etotref cal=-4262.55496142 ref=-4262.61594802 deviation=-0.06098660

250_NO_KP_CR_VDW3ABC

[WARNING ] etotref cal=-447.44343208 ref=-447.51198935 deviation=-0.06855727

250_NO_KP_CR_VDW3BJ

[WARNING ] etotref cal=-4262.73795480 ref=-4262.79894140 deviation=-0.06098660

281_NO_KP_HSE

[WARNING ] totalstressref cal=1076.56558700 ref=1076.56529600 deviation=-0.00029100

283_NO_restart

[WARNING ] totalforceref cal=0.46163100 ref=0.46119300 deviation=-0.00043800

307_NO_GO_OH

[WARNING ] etotref cal=-204.59803974 ref=-204.59557974 deviation=0.00246000

601_NO_TDDFT_CO

[WARNING ] totalstressref cal=26.91672400 ref=26.91655700 deviation=-0.00016700

601_NO_TDDFT_CO_occ

[WARNING ] etotref cal=-602.93267956 ref=-602.93251189 deviation=0.00016767

801_PW_LT_sc

[WARNING ] totalstressref cal=32.28647100 ref=32.28615700 deviation=-0.00031400

802_PW_LT_fcc

[WARNING ] totalstressref cal=298.14568400 ref=298.15201200 deviation=0.00632800

803_PW_LT_bcc

[WARNING ] totalstressref cal=84.02774600 ref=84.02884300 deviation=0.00109700

804_PW_LT_hexagonal

[WARNING ] totalforceref cal=6.62548000 ref=6.62535800 deviation=-0.00012200

805_PW_LT_trigonal

[WARNING ] totalstressref cal=49.77693000 ref=49.77765100 deviation=0.00072100

806_PW_LT_st

[WARNING ] totalstressref cal=16.93875500 ref=16.93934600 deviation=0.00059100

807_PW_LT_bct

[WARNING ] totalstressref cal=33.62608300 ref=33.62753100 deviation=0.00144800

808_PW_LT_so

[WARNING ] totalforceref cal=6.79467800 ref=6.79456600 deviation=-0.00011200

809_PW_LT_baco

[WARNING ] totalforceref cal=6.79467800 ref=6.79456600 deviation=-0.00011200

810_PW_LT_fco

[WARNING ] etotref cal=-30.33829951 ref=-30.44940902 deviation=-0.11110951

811_PW_LT_bco

[WARNING ] totalforceref cal=6.54382800 ref=6.54372400 deviation=-0.00010400

812_PW_LT_sm

[WARNING ] totalstressref cal=10.81017700 ref=10.80996800 deviation=-0.00020900

813_PW_LT_bacm

[WARNING ] totalstressref cal=21.19969600 ref=21.19886200 deviation=-0.00083400

814_PW_LT_triclinic

[WARNING ] totalstressref cal=11.47930500 ref=11.47893300 deviation=-0.00037200

824_NO_LT_fco

[WARNING ] etotref cal=-31.39417040 ref=-31.69859888 deviation=-0.30442848

hongriTianqi avatar Nov 01 '23 04:11 hongriTianqi

Aotogetwarn.txt I give a auto get warning bash script, to use it with two step. First download the integrated test log file from github, one can download some of the recent PR log file. Then unsing the bash script to get warning.

Zhuxuegang2022 avatar Nov 22 '23 07:11 Zhuxuegang2022

@pxlxingliang could you update the result?

WHUweiqingzhou avatar Aug 22 '24 07:08 WHUweiqingzhou

see #4985

WHUweiqingzhou avatar Aug 22 '24 10:08 WHUweiqingzhou