xtb icon indicating copy to clipboard operation
xtb copied to clipboard

Reproducing CI

Open foxtran opened this issue 1 year ago • 7 comments

Currently (as of Feb 5, 2025), there are several flipping CI tests.

Let's have a look on one of them: https://github.com/grimme-lab/xtb/actions/runs/13165363016/job/36744070869?pr=1180.

After opening it, you can find something like that:

Image

that is related to gfnff tests according to first lines:

Image

So, our target to reproduce this error. Let's build this binary. There is an build instruction corresponding to failed job: https://github.com/grimme-lab/xtb/blob/5f7a2e245de45f5d09db445a35ab929d34228be7/.github/workflows/fortran-build.yml#L40-L52

So, I'm using built by my hands gfortran-14 on RHEL 8 on x86_64 Arch with MKL, while image has Ubuntu 24.04 and gfortran-12 and OpenBLAS. Anyway:

meson setup reproduce_CI --buildtype=debug --warnlevel=0 -Db_coverage=true -Dlapack=mkl
meson compile -C reproduce_CI

You will see a lot of compilation warnings, as usual, and at the final, you should have a new build of xtb. Now, it is time to run tests:

meson test -C reproduce_CI --print-errorlogs --no-rebuild -t 120 --suite xtb

And then you can see:

Ok:                 32
Expected Fail:      1
Fail:               0
Unexpected Pass:    0
Skipped:            0
Timeout:            0

Ok! It works, you may say. However, it is not everything. During testing, meson sets env variables randomly. For us, the most important env variable is MALLOC_PERTURB_. Please, have a look now which value does it have for failed build. You should find value 255.

Now, let's restart only failed task with this variable:

MALLOC_PERTURB_=255 reproduce_CI/test/unit/tester gfnff

Wait a little bit... And see:

Error termination. Backtrace:
#0  0xb626fe in __testdrive_MOD_escalate_error
	at ../subprojects/test-drive/src/testdrive.F90:1913
#1  0xb628f1 in __testdrive_MOD___final_testdrive_Error_type
	at ../subprojects/test-drive/src/testdrive.F90:1964
#2  0x4f8c0e in test_gfnff_pbc
	at ../test/unit/test_gfnff.f90:751
#3  0x40a394 in run_unittest
	at ../test/unit/main.f90:169
#4  0x40a394 in run_testsuite
	at ../test/unit/main.f90:149
#5  0x40b63e in tester
	at ../test/unit/main.f90:103
#6  0x4080f7 in main
	at ../test/unit/main.f90:20

Hooray! We reproduced CI!

foxtran avatar Feb 05 '25 20:02 foxtran

That's neat, thank you for taking time to explain this:)

Albkat avatar Feb 05 '25 20:02 Albkat

It is a tip, not a bug :(

foxtran avatar Feb 05 '25 21:02 foxtran

It is a tip, not a bug :(

Haha, sorry, it's just for us internally to keep this issue on the to-do list before the next release. I’ve changed it to a task :)

Albkat avatar Feb 05 '25 21:02 Albkat

You can pin issues :)

foxtran avatar Feb 05 '25 21:02 foxtran

@grimme-lab/xtb, I think this should be our current priority so that stacked PRs can be merged before v6.7.2.

Since we are drastically changing our codebase in v7.0.0, it is important to have a stable version before making such changes.

Albkat avatar Feb 10 '25 21:02 Albkat

With https://github.com/tblite/tblite/pull/230 and #1204, CI should not fail time to time :)

foxtran avatar Feb 28 '25 00:02 foxtran

Ah.. We still have couple bugs in cpcm-x lib :(

foxtran avatar Mar 03 '25 18:03 foxtran