xtb GFN-FF timing information, optimization for MD

I am running ONIOM molecular dynamics with GFN-FF as the low level and GFN2-XTB as the high level in ALPB solvent. Total system is ~5000 atoms, SQM region is ~60 atoms. Using 24 CPU cores. I notice the reported timings within the XTB logs for each energy+gradient are quite fast (0.003 seconds for GFN-FF, 0.4 seconds for GFN2-XTB). In practice, each MD step is ~5 seconds, and from my testing seems to be dominated by GFN-FF energy+gradient time. I know in the GFN-FF publication the expected time per step for a similar system should be around 1 second, and I am using more resources. I would be very happy with 0.5-1 seconds per step on these resources. Are there any further optimizations or tricks I can pursue to speed up my MD simulations, or to get some accurate baseline here? Any help would be greatly appreciated!

Feb 12 '25 15:02 venturellac

I have CPU efficiency about 80% for 24 CPU cores for pure GFN-FF for 800 atomic system. The stupid way to check it is to run your MD for couple minutes with time command and get something like:

real	0m50.756s
user	16m47.485s
sys	0m1.725s

Then, you can divide user time to real, and divide result on # of CPUs. For me, it is: (16*60+47)/50/24=0.839. Not bad, but could be better.

Could you please do the same trick for your calc?

Feb 13 '25 14:02 foxtran

Thank you for your reply - this is a helpful start. My system is reporting high CPU utilization, similar to yours ~80%, but the time to evaluate the GFNFF whole system (the bottleneck) is not well accelerated with more cores. The runtime is the same, whether I use 4 cores or 24 cores (around 7 seconds). In other words, it appears GFNFF on the whole system is using the cores, but somehow not benefitting from the usage.

Could you please share some details about environment variables and xtb build instructions I can share with my computing center specialists? Here is an example job I am running to benchmark (xtb version 6.6.1) where ncores = 24 or 4:

`ncores = 24

export OMP_STACKSIZE=20G export OMP_NUM_THREADS=ncores export MKL_NUM_THREADS=ncores export OMP_MAX_ACTIVE_LEVELS=1

xtb my_6000_atoms.pdb --oniom gfn2:gfnff my_inner_region_indices --alpb water --verbose`

Feb 13 '25 15:02 venturellac

Looks like almost all time xtb spends on data sharing/context switching between threads. Especially for larger number of threads.

For 4 cores, I have:

real	1m17.868s
user	4m10.939s
sys	0m2.928s

Again, with 80% CPU efficiency.

So, you can try to find an optimal number of threads. I'm using 4 threads always and submit more tasks instead to occupy the whole node.

Feb 13 '25 17:02 foxtran

Ok this is good to know. I guess I should use 4 or fewer cores for my workload. If it helps for benchmarking/knowledge purposes here are are some detailed timings for GFNFF energy/gradient evaluation:

Component	1 Core	4 Cores	8 Cores	24 Cores
E+G (total)	10.915s	7.570s	8.641s	7.323s
Distance/D3 list	0.105s	0.104s	0.106s	0.104s
Non-bonded repulsion	0.146s	0.047s	0.026s	0.010s
dCN	0.396s	0.384s	0.446s	0.392s
EEQ energy and q	2.500s	1.035s	1.368s	1.112s
D3	2.213s	1.621s	2.185s	1.561s
EEQ gradient	1.175s	0.916s	0.867s	0.854s
Bonds	0.355s	0.261s	0.327s	0.278s
Bend and torsion	0.007s	0.002s	0.001s	0.002s
Bonded ATM	0.008s	0.002s	0.001s	0.001s
HB/XB (incl. list setup)	1.405s	1.000s	1.180s	0.895s
GBSA	3.431s	2.976s	2.925s	2.916s

If I have time I can look into the implementation, but is there any particular reason why specific terms in the hamiltonian benefit from threading or not?

Feb 13 '25 18:02 venturellac

I did not update D3 part in #1178, so it will still have poor parallelization, looks like EEQ energy and q is not well parallelized (see 8 core result), HB/XB - not well parallelized. I'd say that GBSA is not parallelized (and it tooks too much time for you).

Feb 13 '25 18:02 foxtran

Good to know, I also notice that the SHAKE algorithm dominates the memory usage. shake=2 (all bonds) segfaults on 32GB memory, shake=1 or 0 are fine it seems, even on a modest 16GB memory. Is this what you would expect? In your opinion, how challenging would it be to parallelize these bottleneck terms like D3, GBSA, and EEQ terms? I may try do so with my code.

Feb 14 '25 16:02 venturellac

For D3 and EEQ, we have library implementations with much better code quality. Correspondingly, I am fairly certain that the parallelization is also much better. In the long-term, we plan to replace the separate implementations in xtb with the libraries. But this will obviously take some time to do.

Feb 15 '25 08:02 marvinfriede