DFTK.jl Support for NC pseudos in UPF format

Based on @unkcpz 's PR.

To do:

[ ] Non-linear core correction?
[ ] Implement initial guesses
- [ ] Charge density from atomic charge density (PP_RHO)
- [ ] Wavefunction from pseudo-atomic wave functions (PP_CHI)
[ ] Improve documentation
- [ ] Example script
- [ ] Add a page on pseudopotentials
  - [ ] Formulas
  - [ ] Limitations
  - [ ] References
- [x] Docstrings
[ ] Improve performance
- [ ] Parallelize over projectors / G-vectors?
- [ ] Calculate only for unique values of |G|?
- [x] Precompute common integrand terms
- [x] Create interpolators at parse time
[x] Clean up unit handling
[x] Settle on an integrator / integrator interface (Not needed)
[x] Settle on local potential correction / correction interface (Only Coulomb)
[x] Add / remove included UPF pseudopotentials (Kept a few for testing)
[x] Register PseudoPotentialIO
[x] Improve testing
- [x] Figure out why quadgk references used for HGH don't agree well with UPFs (Needed correction term)
- [] Refine HGH comparison tolerances

Oct 03 '22 09:10 azadoks

If the rab are already the integration weights, why do you do a quadrature (ctrap) rather than just a dot product?

Oct 05 '22 08:10 antoine-levitt

I wasn't sure if they account for the half-weight at zero. I did a bit of testing comparing the dot product and the ctrap with the half-weight, and they seem to be in similar agreement with HGH.

I've changed to using a dot product, it's probably a good step for better performance assuming that it's the right thing to do.

Oct 06 '22 11:10 azadoks

It looks like ~78% of time in initializing the PlaneWaveBasis for small Ecut and few k-points with UPFs is taken up by _besselj, which is a call to Amos.

Allocations could still be improved,

function init_si(pseudo_type, n)
    a = 10.26  # Silicon lattice constant in Bohr
    lattice = a / 2 * [[0 1 1.];
                    [1 0 1.];
                    [1 1 0.]]
    Si = ElementPsp(:Si, psp=load_psp("$pseudo_type/pbe/si-q4.$pseudo_type"))
    atoms     = [Si, Si]
    positions = [ones(3)/8, -ones(3)/8]

    model = model_PBE(lattice, atoms, positions)
    for _ = 1:n-1
        PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
    end
    PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
end

basis = @btime init_si("upf", 1);     # benchmark
  6.753 s (26559047 allocations: 865.72 MiB)

Initializing the PlaneWaveBasis for, e.g. Ecut=60, kgrid=[12, 12, 12] takes quite a few minutes. Maybe we need to implement an approximate but optimized sphericalbesselj like what is done in QE? I'm waiting on the profiling + benchmarking for this setup, but I'm guessing that reducing allocations is also going to be a top priority to get this down to a reasonable time.

Oct 06 '22 12:10 azadoks

If the profiling clearly shows up the besselj function, working on allocations isn't going to do anything. Note https://discourse.julialang.org/t/ann-bessels-jl-bessel-hankel-and-airy-functions/87941 but that's only a factor of 2. Note though that these packages are for general (non-integer) order, and the integer-order bessel functions are considerably simpler to compute: https://en.wikipedia.org/wiki/Bessel_function#Spherical_Bessel_functions:_jn,_yn . So at worst we could just implement them by hand for low-order (only l=3 is needed, right?)

Oct 06 '22 13:10 antoine-levitt

With judicious use of @inline and @simd, it should be possible to get the code to vectorize. If that still does not result in acceptable performance, we bring out the big guns (LoopVectorization)

Oct 06 '22 13:10 antoine-levitt

Just did some quick tests for integrating an array of size 100000 with l=2. Naive version (what you have, essentially): 15ms Hand-coded spherical bessel of order 2: 1.5ms Hand-coded + @inline + @simd : 1ms Hand-coded + @turbo : 400μs Hand-coded + not building x but just summing up the result: 300μs

So no need to go do complicated optimizations, it looks like :-) Here's the final version:

using SpecialFunctions
using LoopVectorization

@inline function sphericalbesselj2(order, x)
    (3/x^2-1)*sin(x)/x-3cos(x)/(x^2)
end

function compute(q, rgrid, drgrid, rprojs, order)
    N = length(rgrid)
    x = Vector{Float64}(undef, N)
    s = 0.0
    @turbo for ir = 1:N
        s += drgrid[ir]*sphericalbesselj2(order, q * rgrid[ir]) * rprojs[ir]
    end
    s
end

N = 100000
rgrid = rand(N)
drgrid = rand(N)
rprojs = rand(N)
q = 2.3
order = 2
@btime compute($q, $rgrid, $drgrid, $rprojs, $order)

Oct 06 '22 15:10 antoine-levitt

Sorry, there was a mistake in the previous post (if you read this via email notifications), updated now.

Oct 06 '22 15:10 antoine-levitt

Ok sweet, I did write up the hand-coded spherical Bessels, which was already much faster. I thought that LoopVectorization wasn't a dependency of DFTK, do I didn't give it a go.

I'll work a bit on implementing like this with @turbo, and @inline and check back in, thanks!

Oct 06 '22 15:10 azadoks

# For Silicon, 2-atom cell, si-q4.upf
Ecut = 80
kgrid = [12, 12, 12]
julia> basis = @benchmark PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid)  # benchmark
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 32.746 s (0.65% GC) to evaluate,
 with a memory estimate of 1.41 GiB, over 3165129 allocations.

Much better!

Oct 06 '22 16:10 azadoks

Still looks expensive but combined with q interpolation it should be ok. You can avoid the code duplication by using Val types and function barriers: have order be a Val type, and convert from int to Val in a separate function from the loop (so that the loop is compiled using the static type information for the order, and propagates it to the inlined bessel function). There are instances of such tricks in the "performance tips" section of the julia manual

Oct 06 '22 16:10 antoine-levitt

I think this is as you suggest(?), but it seems to break @turbo.

@inline function sphericalbesselj(order::Val{0}, x)
    sin(x) / x
end
@inline function sphericalbesselj(order::Val{1}, x)
    (sin(x) / x - cos(x)) / x
end
...
function psp_projector_fouier_inner_loop(order::Val, psp::PspUpf, i, l, q::T)::T where {T <: Real}
    N = length(psp.rprojs[l+1][i])
    s = zero(T)
    @turbo for ir = 1:N
        s += sphericalbesselj(l_val, q * psp.rgrid[ir]) * psp.rprojs[l+1][i][ir] * psp.drgrid[ir]
    end
end
function eval_psp_projector_fourier(psp::PspUpf, i, l, q::T)::T where {T <: Real}
    if iszero(q)
        if l == 0
            s = T(1)
        else
            s = T(0)
        end
    else
        l_val = Val(l)
        s = psp_projector_fouier_inner_loop(l_val, psp, i, l, q)
    end
    return s
end

julia> PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid);                    # warm up
┌ Warning: #= /data/azadoks/source/git/DFTK.jl/src/pseudo/PspUpf.jl:172 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.

Oct 06 '22 17:10 azadoks

Hm curious (and annoying). That looks like a limitation of LoopVectorization, I'm not sure how to work around it... that said if just by using hand-coded and SIMD we get reasonable enough performance, we might just as well not use LV...

Oct 06 '22 18:10 antoine-levitt

It seems like @simd also doesn't work with the Val{l} dispatch. With @simd:

# For Silicon, 2-atom cell, si-q4.upf
Ecut = 80
kgrid = [12, 12, 12]
julia> basis = @time PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid);      # benchmark
 70.301417 seconds (29.15 M allocations: 2.576 GiB, 0.61% gc time, 0.08% compilation time)

With @inbounds + @fastmath:

> # For Silicon, 2-atom cell, si-q4.upf
> Ecut = 80
> kgrid = [12, 12, 12]
julia> basis = @time PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid);      # benchmark
 69.722348 seconds (29.08 M allocations: 2.573 GiB, 0.43% gc time)

A glance through the LoopVectorization documentation makes me think that @simd and @turbo might give us problems down the line with dual numbers anyway. It would be a bit messy and repetitive, but I'd propose having both

The verbose @turbo version to be used with Float64 + Float32 which can be easily turbocharged without messing with array of structs -> arrays of numbers transforms
The safe and succinct Val{l}-dispatched @inbounds + @fastmath version for everything else

Oct 07 '22 09:10 azadoks

Dftk does pride itself on simplicity so I'd say let's forget about LV, just use simd and Val. The timings you report are reasonable I'd say. In any event, if we want to get more perf, we just do interpolation in q space (which even with a very fine mesh should result in an extremely significant boost, since we are basically sampling all of 3d q space here, and the functions should be very smooth functions of q)

Oct 07 '22 13:10 antoine-levitt

Also LV is kind of in flux and due for a rewrite from what I understood, so hopefully in one year or so we just slap turbo on the Val version and it'll just work

Oct 07 '22 13:10 antoine-levitt

Another thing that could be worth trying is to parallelise over the evaluation of projectors or the evaluation of Fourier components in the local potential. This will certainly help for larger calculations on clusters.

Oct 08 '22 09:10 mfherbst

I agree with Antoine regarding the tradeoff simplicity / speed. I think what we have now is ok in speed and I would avoid making the code more complicated if possible.

Oct 08 '22 09:10 mfherbst

Add / remove included UPF pseudopotentials

On that point: I think we should certainly not try to have a comprehensive UPF pseudo library in our code base. I see two options:

Introduce a data artefact containing all sorts of pseudo libraries (that could then be referenced by whomever who needs this, so UPF.jl, DFTK, other people ...).
Have only a very stripped down version in DFTK (implying potential false promises about their quality, what works etc.) and beyond that have people manage their pseudos themselves (can already be done now: load_psp takes an arbitrary path to load your custom pseudos).

I lean towards the first thing, but I don't have an overview how many pseudo libraries are actually used, i.e. if there is a combinatorial explosion sitting behind this solution that one should better not buy into.

Oct 08 '22 09:10 mfherbst

Also: I think there are still some stability issues. Just out of curiosity I tried a UPF for a pretty big atom (Hafnium) and got greeted with NaNs. I have not investigated further as I think we should slowly make our way up the complexity ladder, but just to warn everyone :).

Oct 08 '22 09:10 mfherbst

Agreed we should not have these in the repo. Gth are fine because they are super small fine, but beyond that it's fine to require people to get the upf themselves (and having a repo with many of them is a separate project)

Oct 08 '22 09:10 antoine-levitt

Just out of curiosity I tried a UPF for a pretty big atom (Hafnium) and got greeted with NaNs. I have not investigated further as I think we should slowly make our way up the complexity ladder, but just to warn everyone :).

I suspect this is because it's on a linear grid (is it from ONCVPSP?), and the linear grids usually contain r = 0. The sphericalbesselj functions divide by q * r, and we don't check for iszero(psp.rgrid[ir]), only for iszero(q).

Oct 08 '22 21:10 azadoks

Agreed we should not have these in the repo. Gth are fine because they are super small fine, but beyond that it's fine to require people to get the upf themselves (and having a repo with many of them is a separate project)

Sounds good. It could be nice eventually to have something à la aiida-pseudo which packages the pseudos, recommended cutoffs, and other metadata from projects like the SSSP and Pseudo-Dojo along with the UPFs / PSPs. We can kick this down the road for now.

Shall we leave in a few pseudos for testing purposes? Maybe si-q4.upf for testing consistency with HGH and the Pseudo-Dojo silicon pseudo for testing ONCVPSP pseudos (i.e. linear grid pseudos)?

There might be some other pseudo generators that have their own quirks for which we could have an example for testing; I should probably look into this...

Oct 08 '22 21:10 azadoks

I made a bit of a mess yesterday trying to rebase, but I've fixed up a few things:

Fixed (I hope) the NaN problem
Moved the sphericalbesselj functions to common
Refactored eval_psp_projector_fourier
Fixed the real-fourier consistency tests
- Instantiate real-space interpolators at parse time so that tests take a reasonable amount of time
Trimmed down the number of UPF files to a few used for testing
Renamed the UPF.jl package to PseudoPotentialIO.jl
- Made the repo public and started the registration process
- Should be available in a few days

What do you guys think are the remaining key points to address to wrap this all up?

Oct 11 '22 08:10 azadoks

Re PseudoPotentialIO.jl: why so complicated? Would not PseudoPotential be sufficient? I mean long-term I guess we should move the PspUpf and the PspHgh busyness from DFTK there as well, don't you think?

Trimmed down the number of UPF files to a few used for testing´

I'm not sure we should have them in data as that could suggest too much. Perhaps just have them in the test subfolder somewhere?

What do you guys think are the remaining key points to address to wrap this all up?

From my point of view there are two blocking things:

The unit issue
In case not yet done: Some careful test against QE or ABINIT in an example (i.e. show that you can reproduce the numbers over a reasonable range of Ecuts). Maybe with something non-HGH?

A nice example could also be good (I guess you can just download the UPF file from e.g. https://pseudopotentials.quantum-espresso.org/legacy_tables/hartwigesen-goedecker-hutter-pp/si in the test and do the SCF once with the UPF and once with the analytical HGH?

Oct 11 '22 08:10 mfherbst

Re PseudoPotentialIO.jl: why so complicated? Would not PseudoPotential be sufficient? I mean long-term I guess we should move the PspUpf and the PspHgh busyness from DFTK there as well, don't you think?

I agree, but PseudoPotentials.jl was already registered. I guess I could have registered PseudoPotential.jl (maybe a bit too close?); I can always change it.

I'm not sure we should have them in data as that could suggest too much. Perhaps just have them in the test subfolder somewhere?

True, I'll move them into test/

In case not yet done: Some careful test against QE or ABINIT in an example (i.e. show that you can reproduce the numbers over a reasonable range of Ecuts). Maybe with something non-HGH?

Ah yeah, I think this will definitely take some careful thought to try to get all the inputs lined up as well as we can. I'll give it a first shot and share the input files.

Oct 11 '22 09:10 azadoks

Ah yeah, I think this will definitely take some careful thought to try to get all the inputs lined up as well as we can. I'll give it a first shot and share the input files.

Oh yes, this is tricky. Don't try to be too fancy at first (i.e. use a simple functional and setup, first one projector then many etc). But it's worth doing. I found quite a few smaller errors along the way when I did this with ABINIT back in the days.

Oct 11 '22 09:10 mfherbst

The unit issue

Yeah, this is quite messy.

rgrid -- Always in Bohr
vloc -- Always in Ry
beta -- Either in Bohr^{-1/2} (Vanderbilt USPP) or in Ry Bohr^{-1/2} ("some converters")
Dion / Dij / h -- Either in Ry (Vanderbilt USPP) or in Ry^-1 ("some converters")

A note on units: the beta functions and D matrix are defined as in Vanderbilt’s US PP code and should have Bohr^{-1/2} and Ry units, respectively. Since they enter the calculation only as (betaDbeta), one might as well have "dion" in Ry^{-1} units and "beta" in Ry*Bohr^{-1/2} units, in line with what suggested by the Kleinman-Bylader transformation. Some converters actually do exactly that.

Because the units of beta and D are undefined, I'm not sure how to deal with this generally. For the pseudos I've tested, it seems like the Kleinman-Bylander units are being used.

For the local potential, I don't understand why, but if I convert before the FT, V(q) has strong oscillations. I also still don't completely understand why the valence number Z needs to be multiplied by two. I'll spend some time starting at it again, but if anyone can see why all this is happening more easily, let me know!

Oct 11 '22 09:10 azadoks

I also still don't completely understand why the valence number Z needs to be multiplied by two

The potential is Z/r in Hartree, so presumably 2Z/r in Rydbergs? Rydbergs are an atrocious unit, the less we deal with it the better, so ideally everything would get converted to atomic units (bohr, Hartree) in the UPF parser...

Oct 11 '22 09:10 antoine-levitt

I agree, but PseudoPotentials.jl was already registered.

Crap ... too bad then let's leave it as is for now.

Oct 11 '22 09:10 mfherbst

so ideally everything would get converted to atomic units (bohr, Hartree) in the UPF parser...

Clear +1 on that statement from me.

Oct 11 '22 09:10 mfherbst

DFTK.jl DFTK.jl copied to clipboard

Support for NC pseudos in UPF format

DFTK.jl
DFTK.jl copied to clipboard