GFN-FF timing information, optimization for MD
I am running ONIOM molecular dynamics with GFN-FF as the low level and GFN2-XTB as the high level in ALPB solvent. Total system is ~5000 atoms, SQM region is ~60 atoms. Using 24 CPU cores. I notice the reported timings within the XTB logs for each energy+gradient are quite fast (0.003 seconds for GFN-FF, 0.4 seconds for GFN2-XTB). In practice, each MD step is ~5 seconds, and from my testing seems to be dominated by GFN-FF energy+gradient time. I know in the GFN-FF publication the expected time per step for a similar system should be around 1 second, and I am using more resources. I would be very happy with 0.5-1 seconds per step on these resources. Are there any further optimizations or tricks I can pursue to speed up my MD simulations, or to get some accurate baseline here? Any help would be greatly appreciated!
I have CPU efficiency about 80% for 24 CPU cores for pure GFN-FF for 800 atomic system. The stupid way to check it is to run your MD for couple minutes with time command and get something like:
real 0m50.756s
user 16m47.485s
sys 0m1.725s
Then, you can divide user time to real, and divide result on # of CPUs. For me, it is: (16*60+47)/50/24=0.839. Not bad, but could be better.
Could you please do the same trick for your calc?
Thank you for your reply - this is a helpful start. My system is reporting high CPU utilization, similar to yours ~80%, but the time to evaluate the GFNFF whole system (the bottleneck) is not well accelerated with more cores. The runtime is the same, whether I use 4 cores or 24 cores (around 7 seconds). In other words, it appears GFNFF on the whole system is using the cores, but somehow not benefitting from the usage.
Could you please share some details about environment variables and xtb build instructions I can share with my computing center specialists? Here is an example job I am running to benchmark (xtb version 6.6.1) where ncores = 24 or 4:
`ncores = 24
export OMP_STACKSIZE=20G export OMP_NUM_THREADS=ncores export MKL_NUM_THREADS=ncores export OMP_MAX_ACTIVE_LEVELS=1
xtb my_6000_atoms.pdb --oniom gfn2:gfnff my_inner_region_indices --alpb water --verbose`
Looks like almost all time xtb spends on data sharing/context switching between threads. Especially for larger number of threads.
For 4 cores, I have:
real 1m17.868s
user 4m10.939s
sys 0m2.928s
Again, with 80% CPU efficiency.
So, you can try to find an optimal number of threads. I'm using 4 threads always and submit more tasks instead to occupy the whole node.
Ok this is good to know. I guess I should use 4 or fewer cores for my workload. If it helps for benchmarking/knowledge purposes here are are some detailed timings for GFNFF energy/gradient evaluation:
| Component | 1 Core | 4 Cores | 8 Cores | 24 Cores |
|---|---|---|---|---|
| E+G (total) | 10.915s | 7.570s | 8.641s | 7.323s |
| Distance/D3 list | 0.105s | 0.104s | 0.106s | 0.104s |
| Non-bonded repulsion | 0.146s | 0.047s | 0.026s | 0.010s |
| dCN | 0.396s | 0.384s | 0.446s | 0.392s |
| EEQ energy and q | 2.500s | 1.035s | 1.368s | 1.112s |
| D3 | 2.213s | 1.621s | 2.185s | 1.561s |
| EEQ gradient | 1.175s | 0.916s | 0.867s | 0.854s |
| Bonds | 0.355s | 0.261s | 0.327s | 0.278s |
| Bend and torsion | 0.007s | 0.002s | 0.001s | 0.002s |
| Bonded ATM | 0.008s | 0.002s | 0.001s | 0.001s |
| HB/XB (incl. list setup) | 1.405s | 1.000s | 1.180s | 0.895s |
| GBSA | 3.431s | 2.976s | 2.925s | 2.916s |
If I have time I can look into the implementation, but is there any particular reason why specific terms in the hamiltonian benefit from threading or not?
I did not update D3 part in #1178, so it will still have poor parallelization, looks like EEQ energy and q is not well parallelized (see 8 core result), HB/XB - not well parallelized. I'd say that GBSA is not parallelized (and it tooks too much time for you).
Good to know, I also notice that the SHAKE algorithm dominates the memory usage. shake=2 (all bonds) segfaults on 32GB memory, shake=1 or 0 are fine it seems, even on a modest 16GB memory. Is this what you would expect? In your opinion, how challenging would it be to parallelize these bottleneck terms like D3, GBSA, and EEQ terms? I may try do so with my code.
For D3 and EEQ, we have library implementations with much better code quality. Correspondingly, I am fairly certain that the parallelization is also much better. In the long-term, we plan to replace the separate implementations in xtb with the libraries. But this will obviously take some time to do.