DFTK.jl
DFTK.jl copied to clipboard
Support for NC pseudos in UPF format
Based on @unkcpz 's PR.
To do:
- [ ] Non-linear core correction?
- [ ] Implement initial guesses
- [ ] Charge density from atomic charge density (
PP_RHO
) - [ ] Wavefunction from pseudo-atomic wave functions (
PP_CHI
)
- [ ] Charge density from atomic charge density (
- [ ] Improve documentation
- [ ] Example script
- [ ] Add a page on pseudopotentials
- [ ] Formulas
- [ ] Limitations
- [ ] References
- [x] Docstrings
- [ ] Improve performance
- [ ] Parallelize over projectors / G-vectors?
- [ ] Calculate only for unique values of
|G|
? - [x] Precompute common integrand terms
- [x] Create interpolators at parse time
- [x] Clean up unit handling
- [x] Settle on an integrator / integrator interface (Not needed)
- [x] Settle on local potential correction / correction interface (Only Coulomb)
- [x] Add / remove included UPF pseudopotentials (Kept a few for testing)
- [x] Register
PseudoPotentialIO
- [x] Improve testing
- [x] Figure out why
quadgk
references used for HGH don't agree well with UPFs (Needed correction term) - [] Refine HGH comparison tolerances
- [x] Figure out why
If the rab are already the integration weights, why do you do a quadrature (ctrap) rather than just a dot product?
I wasn't sure if they account for the half-weight at zero. I did a bit of testing comparing the dot product and the ctrap with the half-weight, and they seem to be in similar agreement with HGH.
I've changed to using a dot product, it's probably a good step for better performance assuming that it's the right thing to do.
It looks like ~78% of time in initializing the PlaneWaveBasis
for small Ecut and few k-points with UPFs is taken up by _besselj
, which is a call to Amos.
Allocations could still be improved,
function init_si(pseudo_type, n)
a = 10.26 # Silicon lattice constant in Bohr
lattice = a / 2 * [[0 1 1.];
[1 0 1.];
[1 1 0.]]
Si = ElementPsp(:Si, psp=load_psp("$pseudo_type/pbe/si-q4.$pseudo_type"))
atoms = [Si, Si]
positions = [ones(3)/8, -ones(3)/8]
model = model_PBE(lattice, atoms, positions)
for _ = 1:n-1
PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
end
PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
end
basis = @btime init_si("upf", 1); # benchmark
6.753 s (26559047 allocations: 865.72 MiB)
Initializing the PlaneWaveBasis
for, e.g. Ecut=60
, kgrid=[12, 12, 12]
takes quite a few minutes. Maybe we need to implement an approximate but optimized sphericalbesselj
like what is done in QE?
I'm waiting on the profiling + benchmarking for this setup, but I'm guessing that reducing allocations is also going to be a top priority to get this down to a reasonable time.
If the profiling clearly shows up the besselj function, working on allocations isn't going to do anything. Note https://discourse.julialang.org/t/ann-bessels-jl-bessel-hankel-and-airy-functions/87941 but that's only a factor of 2. Note though that these packages are for general (non-integer) order, and the integer-order bessel functions are considerably simpler to compute: https://en.wikipedia.org/wiki/Bessel_function#Spherical_Bessel_functions:_jn,_yn . So at worst we could just implement them by hand for low-order (only l=3 is needed, right?)
With judicious use of @inline
and @simd
, it should be possible to get the code to vectorize. If that still does not result in acceptable performance, we bring out the big guns (LoopVectorization)
Just did some quick tests for integrating an array of size 100000 with l=2.
Naive version (what you have, essentially): 15ms
Hand-coded spherical bessel of order 2: 1.5ms
Hand-coded + @inline
+ @simd
: 1ms
Hand-coded + @turbo
: 400μs
Hand-coded + not building x
but just summing up the result: 300μs
So no need to go do complicated optimizations, it looks like :-) Here's the final version:
using SpecialFunctions
using LoopVectorization
@inline function sphericalbesselj2(order, x)
(3/x^2-1)*sin(x)/x-3cos(x)/(x^2)
end
function compute(q, rgrid, drgrid, rprojs, order)
N = length(rgrid)
x = Vector{Float64}(undef, N)
s = 0.0
@turbo for ir = 1:N
s += drgrid[ir]*sphericalbesselj2(order, q * rgrid[ir]) * rprojs[ir]
end
s
end
N = 100000
rgrid = rand(N)
drgrid = rand(N)
rprojs = rand(N)
q = 2.3
order = 2
@btime compute($q, $rgrid, $drgrid, $rprojs, $order)
Sorry, there was a mistake in the previous post (if you read this via email notifications), updated now.
Ok sweet, I did write up the hand-coded spherical Bessels, which was already much faster. I thought that LoopVectorization
wasn't a dependency of DFTK, do I didn't give it a go.
I'll work a bit on implementing like this with @turbo
, and @inline
and check back in, thanks!
# For Silicon, 2-atom cell, si-q4.upf
Ecut = 80
kgrid = [12, 12, 12]
julia> basis = @benchmark PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid) # benchmark
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 32.746 s (0.65% GC) to evaluate,
with a memory estimate of 1.41 GiB, over 3165129 allocations.
Much better!
Still looks expensive but combined with q interpolation it should be ok. You can avoid the code duplication by using Val types and function barriers: have order be a Val type, and convert from int to Val in a separate function from the loop (so that the loop is compiled using the static type information for the order, and propagates it to the inlined bessel function). There are instances of such tricks in the "performance tips" section of the julia manual
I think this is as you suggest(?), but it seems to break @turbo
.
@inline function sphericalbesselj(order::Val{0}, x)
sin(x) / x
end
@inline function sphericalbesselj(order::Val{1}, x)
(sin(x) / x - cos(x)) / x
end
...
function psp_projector_fouier_inner_loop(order::Val, psp::PspUpf, i, l, q::T)::T where {T <: Real}
N = length(psp.rprojs[l+1][i])
s = zero(T)
@turbo for ir = 1:N
s += sphericalbesselj(l_val, q * psp.rgrid[ir]) * psp.rprojs[l+1][i][ir] * psp.drgrid[ir]
end
end
function eval_psp_projector_fourier(psp::PspUpf, i, l, q::T)::T where {T <: Real}
if iszero(q)
if l == 0
s = T(1)
else
s = T(0)
end
else
l_val = Val(l)
s = psp_projector_fouier_inner_loop(l_val, psp, i, l, q)
end
return s
end
julia> PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid); # warm up
┌ Warning: #= /data/azadoks/source/git/DFTK.jl/src/pseudo/PspUpf.jl:172 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
Hm curious (and annoying). That looks like a limitation of LoopVectorization, I'm not sure how to work around it... that said if just by using hand-coded and SIMD we get reasonable enough performance, we might just as well not use LV...
It seems like @simd
also doesn't work with the Val{l}
dispatch.
With @simd
:
# For Silicon, 2-atom cell, si-q4.upf
Ecut = 80
kgrid = [12, 12, 12]
julia> basis = @time PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid); # benchmark
70.301417 seconds (29.15 M allocations: 2.576 GiB, 0.61% gc time, 0.08% compilation time)
With @inbounds
+ @fastmath
:
> # For Silicon, 2-atom cell, si-q4.upf
> Ecut = 80
> kgrid = [12, 12, 12]
julia> basis = @time PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid); # benchmark
69.722348 seconds (29.08 M allocations: 2.573 GiB, 0.43% gc time)
A glance through the LoopVectorization
documentation makes me think that @simd
and @turbo
might give us problems down the line with dual numbers anyway. It would be a bit messy and repetitive, but I'd propose having both
- The verbose
@turbo
version to be used withFloat64
+Float32
which can be easily turbocharged without messing with array of structs -> arrays of numbers transforms - The safe and succinct
Val{l}
-dispatched@inbounds
+@fastmath
version for everything else
Dftk does pride itself on simplicity so I'd say let's forget about LV, just use simd and Val. The timings you report are reasonable I'd say. In any event, if we want to get more perf, we just do interpolation in q space (which even with a very fine mesh should result in an extremely significant boost, since we are basically sampling all of 3d q space here, and the functions should be very smooth functions of q)
Also LV is kind of in flux and due for a rewrite from what I understood, so hopefully in one year or so we just slap turbo on the Val version and it'll just work
Another thing that could be worth trying is to parallelise over the evaluation of projectors or the evaluation of Fourier components in the local potential. This will certainly help for larger calculations on clusters.
I agree with Antoine regarding the tradeoff simplicity / speed. I think what we have now is ok in speed and I would avoid making the code more complicated if possible.
Add / remove included UPF pseudopotentials
On that point: I think we should certainly not try to have a comprehensive UPF pseudo library in our code base. I see two options:
- Introduce a data artefact containing all sorts of pseudo libraries (that could then be referenced by whomever who needs this, so UPF.jl, DFTK, other people ...).
- Have only a very stripped down version in DFTK (implying potential false promises about their quality, what works etc.) and beyond that have people manage their pseudos themselves (can already be done now:
load_psp
takes an arbitrary path to load your custom pseudos).
I lean towards the first thing, but I don't have an overview how many pseudo libraries are actually used, i.e. if there is a combinatorial explosion sitting behind this solution that one should better not buy into.
Also: I think there are still some stability issues. Just out of curiosity I tried a UPF for a pretty big atom (Hafnium) and got greeted with NaNs. I have not investigated further as I think we should slowly make our way up the complexity ladder, but just to warn everyone :).
Agreed we should not have these in the repo. Gth are fine because they are super small fine, but beyond that it's fine to require people to get the upf themselves (and having a repo with many of them is a separate project)
Just out of curiosity I tried a UPF for a pretty big atom (Hafnium) and got greeted with NaNs. I have not investigated further as I think we should slowly make our way up the complexity ladder, but just to warn everyone :).
I suspect this is because it's on a linear grid (is it from ONCVPSP?), and the linear grids usually contain r = 0
. The sphericalbesselj
functions divide by q * r
, and we don't check for iszero(psp.rgrid[ir])
, only for iszero(q)
.
Agreed we should not have these in the repo. Gth are fine because they are super small fine, but beyond that it's fine to require people to get the upf themselves (and having a repo with many of them is a separate project)
Sounds good. It could be nice eventually to have something à la aiida-pseudo
which packages the pseudos, recommended cutoffs, and other metadata from projects like the SSSP and Pseudo-Dojo along with the UPFs / PSPs. We can kick this down the road for now.
Shall we leave in a few pseudos for testing purposes? Maybe si-q4.upf
for testing consistency with HGH and the Pseudo-Dojo silicon pseudo for testing ONCVPSP
pseudos (i.e. linear grid pseudos)?
There might be some other pseudo generators that have their own quirks for which we could have an example for testing; I should probably look into this...
I made a bit of a mess yesterday trying to rebase, but I've fixed up a few things:
- Fixed (I hope) the
NaN
problem - Moved the
sphericalbesselj
functions tocommon
- Refactored
eval_psp_projector_fourier
- Fixed the real-fourier consistency tests
- Instantiate real-space interpolators at parse time so that tests take a reasonable amount of time
- Trimmed down the number of UPF files to a few used for testing
- Renamed the
UPF.jl
package toPseudoPotentialIO.jl
- Made the repo public and started the registration process
- Should be available in a few days
What do you guys think are the remaining key points to address to wrap this all up?
Re PseudoPotentialIO.jl
: why so complicated? Would not PseudoPotential
be sufficient? I mean long-term I guess we should move the PspUpf
and the PspHgh
busyness from DFTK there as well, don't you think?
Trimmed down the number of UPF files to a few used for testing´
I'm not sure we should have them in data
as that could suggest too much. Perhaps just have them in the test
subfolder somewhere?
What do you guys think are the remaining key points to address to wrap this all up?
From my point of view there are two blocking things:
- The unit issue
- In case not yet done: Some careful test against QE or ABINIT in an example (i.e. show that you can reproduce the numbers over a reasonable range of Ecuts). Maybe with something non-HGH?
A nice example could also be good (I guess you can just download the UPF file from e.g. https://pseudopotentials.quantum-espresso.org/legacy_tables/hartwigesen-goedecker-hutter-pp/si in the test and do the SCF once with the UPF and once with the analytical HGH?
Re PseudoPotentialIO.jl: why so complicated? Would not PseudoPotential be sufficient? I mean long-term I guess we should move the PspUpf and the PspHgh busyness from DFTK there as well, don't you think?
I agree, but PseudoPotentials.jl
was already registered. I guess I could have registered PseudoPotential.jl
(maybe a bit too close?); I can always change it.
I'm not sure we should have them in data as that could suggest too much. Perhaps just have them in the test subfolder somewhere?
True, I'll move them into test/
In case not yet done: Some careful test against QE or ABINIT in an example (i.e. show that you can reproduce the numbers over a reasonable range of Ecuts). Maybe with something non-HGH?
Ah yeah, I think this will definitely take some careful thought to try to get all the inputs lined up as well as we can. I'll give it a first shot and share the input files.
Ah yeah, I think this will definitely take some careful thought to try to get all the inputs lined up as well as we can. I'll give it a first shot and share the input files.
Oh yes, this is tricky. Don't try to be too fancy at first (i.e. use a simple functional and setup, first one projector then many etc). But it's worth doing. I found quite a few smaller errors along the way when I did this with ABINIT back in the days.
The unit issue
Yeah, this is quite messy.
-
rgrid
-- Always in Bohr -
vloc
-- Always in Ry -
beta
-- Either in Bohr^{-1/2} (Vanderbilt USPP) or in Ry Bohr^{-1/2} ("some converters") -
Dion
/Dij
/h
-- Either in Ry (Vanderbilt USPP) or in Ry^-1 ("some converters")
A note on units: the beta functions and D matrix are defined as in Vanderbilt’s US PP code and should have Bohr^{-1/2} and Ry units, respectively. Since they enter the calculation only as (betaDbeta), one might as well have "dion" in Ry^{-1} units and "beta" in Ry*Bohr^{-1/2} units, in line with what suggested by the Kleinman-Bylader transformation. Some converters actually do exactly that.
Because the units of beta
and D
are undefined, I'm not sure how to deal with this generally. For the pseudos I've tested, it seems like the Kleinman-Bylander units are being used.
For the local potential, I don't understand why, but if I convert before the FT, V(q)
has strong oscillations. I also still don't completely understand why the valence number Z
needs to be multiplied by two. I'll spend some time starting at it again, but if anyone can see why all this is happening more easily, let me know!
I also still don't completely understand why the valence number Z needs to be multiplied by two
The potential is Z/r in Hartree, so presumably 2Z/r in Rydbergs? Rydbergs are an atrocious unit, the less we deal with it the better, so ideally everything would get converted to atomic units (bohr, Hartree) in the UPF parser...
I agree, but PseudoPotentials.jl was already registered.
Crap ... too bad then let's leave it as is for now.
so ideally everything would get converted to atomic units (bohr, Hartree) in the UPF parser...
Clear +1 on that statement from me.