TensorKit.jl WIP: redesign tensormap structure

Plan:

[x] New fusion tree iterator, that accepts sector iterators for outgoing and uses new ordering that sorts fusion trees such that left most values are changing the quickest in the list: outgoing[1], outgoing[2], vertex[1], inner[1], outgoing[3], vertex[2], inner[2], …
[x] Encode all structure information at the level of the HomSpace and experiment with different strategies for caching this information
[x] Store all the tensor data in one sequential DenseVector
[ ] (optional): Experiment whether it is beneficial to use a Memory object in Julia 1.11 and beyond.

Jul 29 '24 14:07 Jutho

Codecov Report

Attention: Patch coverage is 84.97653% with 160 lines in your changes missing coverage. Please review.

Project coverage is 81.02%. Comparing base (419f88c) to head (aca1263). Report is 51 commits behind head on master.

Files with missing lines	Patch %	Lines
src/tensors/tensor.jl	77.96%	26 Missing :warning:
src/tensors/truncation.jl	63.63%	24 Missing :warning:
src/tensors/braidingtensor.jl	11.53%	23 Missing :warning:
src/tensors/abstracttensor.jl	64.81%	19 Missing :warning:
src/planar/macros.jl	36.84%	12 Missing :warning:
src/spaces/homspace.jl	86.95%	12 Missing :warning:
src/tensors/linalg.jl	92.25%	11 Missing :warning:
src/fusiontrees/iterator.jl	94.44%	8 Missing :warning:
ext/TensorKitChainRulesCoreExt/tensoroperations.jl	53.33%	7 Missing :warning:
src/fusiontrees/manipulations.jl	80.95%	4 Missing :warning:
... and 8 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #140      +/-   ##
==========================================
+ Coverage   79.72%   81.02%   +1.30%     
==========================================
  Files          42       41       -1     
  Lines        4961     5001      +40     
==========================================
+ Hits         3955     4052      +97     
+ Misses       1006      949      -57

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jul 30 '24 14:07 codecov[bot]

I've been continuing with this PR and will soon push an update. One set of questions that comes up is what we do with indexing, in particular to TrivialTensorMap.

Defining TrivialTensorMap as a type alias is already less convenient, since the sectortype is no longer part of the parameter list of TensorMap. I am hesitant to make unions such as S<:Union{ComplexSpace,CartesianSpace,GeneralSpace} since that is not extensible, so I would just get rid of the type alias.
Indexing into TrivialTensorMap can use no argument t[] (return the data as N=N1+N2-dimensional tensor), t[i1,...,iN] (returning a scalar, equivalent to t[][i1,...,iN]), t[(nothing,nothing)] where nothing is a substitute for the trivial fusion tree (and thus t[(nothing,nothing)] == t[]).

The question is whether we want to keep all of those. Without a type alias, they would start with sectortype(t) == Trivial || throw(error...).

Sep 11 '24 16:09 Jutho

Another question is what the interface needs to be to get the data vector of a TensorMap; do we just use t.data or do we want an explicit interface for that. Before the data was the list of blocks, and so blocks(t) and block(t,c) was the public interface to access the raw data. However, simple vector space methods, in particular everything from VectorInterface.jl, can now be rewritten to act directly on the data vector without needing any of the structure info. So acting on the data field is quite common.

Sep 11 '24 17:09 Jutho

I think my first reaction is to indeed just get rid of TrivialTensorMap, as having to special-case this was already not so convenient. I really don't like the getindex(::TrivialTensorMap, ::Nothing, ::Nothing) method, and I think that unifying the non-symmetric and symmetric tensors might also promote writing code that is not specific to no symmetry.

This being said, I would definitely like to keep the indexing behavior for non-symmetric tensors, both with scalar getindex and setindex methods, and even to consider allowing slices/view. It is still not too straightforward to fill up a symmetric tensor, so I don't think we want to get rid of this. I have to say that I would even be in favor of having a getindex for symmetric tensors, which would act in a way similar to CUDA arrays, where you can do it but you have to either be in the REPL, or explicitly declare that you want to do scalar indexing into a symmetric tensor, with warnings etc for the performance pitfalls. For Abelian symmetries we could even consider allowing setindex, with the same caveats?

I don't have a strong opinion on having an interface for accessing the data. From what I understand, the fields of a type are never really considered public, and for TensorMap this has also never really been the case, the exact way the data is stored is kept hidden as much as possible. Given that it is mostly VectorInterface that can use direct access to the field to speed up the implementation, I'm okay with that being "don't try this at home direct field access" :grin: . I also don't think there is a lot of benefit to have this as part of the interface, and would really consider it an implementation detail.

Sep 11 '24 18:09 lkdvos

I am all in favor of getting rid of the (nothing, nothing) indexing.

For the nonsymmetric / trivially symmetric tensors, the identity

t[scalarindices...] == t[][scalarindices...]

implies that maybe the left one is not strictly necessary, as it is still accessible (in an even more flexible way that also allows slicing etc) as long als t[] still works. However, maybe we rather want to deprecate instead t[].

The reason that I would consider providing an interface to at least get the data, is that unfortunately there are many external libraries that only work on AbstractArray or AbstractVector instances, e.g. DifferentialEquations.jl, several nonlinear optimisation libraries, … . At least with the new data structure, it will become easier to use these in combination with TensorMap instances, as it will be a matter of doing

some_external_function(f, t.data)

where f is a function that encodes an operation on tensors, but now needs to be written as

function f(data)
 # reconstruct tensor
 t = TensorMap(data, space)
 # act
 tnew = # act on t ...
 # return data in case output is also a tensor (e.g. ODE right hand side)
 return tnew.data
end

We could even go one step further and make AbstractTensorMap{T} <: AbstractVector{T} with length(t) = dim(t). In that case, getindex and setindex! with a single index need to be supported and acts directly on t.data.

So I think this requires some careful deliberation weighing the pros and cons of different approaches.

Sep 11 '24 18:09 Jutho

I think for the question of t[][inds...] vs t[inds...], my preference goes to the latter. The main reasons being that it is just a little more intuitive, it is a correct overload of Base.getindex, and is more intuitive for people that are used to Array objects. Additionally, t[] necessarily has to work with views, which can easily become confusing, and while diagonalt[inds] = val could in principle be made to work, this becomes a lot more difficult with diagonalt[][inds] = val. I have no real problem with having to copy over some of the AbstractArray implementations to make these convenience methods work.

I am not a big fan of AbstractTensorMap <: AbstractVector in the way you describe, because really that only applies to TensorMap. For AdjointTensorMap, this now has a very weird interplay with what getindex should do, as the data is now non-trivially permuted. For DiagonalTensorMap you in principle no longer have dim(diagt) = length(diagt.data), and for BlockTensorMap this becomes a big hassle to make the indexing work as wanted because now it would be an AbstractArray, but not in the way that is intuitive. I think it would create more problems than it would solve.

As a sidenote, we already have some functionality for the type of functions you describe through FiniteDifferences.to_vec:

t0 = input_tensor
v0, from_vec = to_vec(to)
function my_wrapper_fun(v::Vector)
    t = from_vec(v)
    [...]
    return to_vec(result)[1]
end

I also just realised that this allows a little more fine-grained control over the fact that our data does not have the inner product you would expect whenever non-abelian symmetries are involved, i.e. inner(x.data, y.data) != inner(x, y) in general. This is a really subtle problem that makes me even more hesitant to make the data part of the interface :upside_down_face:

Sep 11 '24 20:09 lkdvos

There are all good points; the AbstractTensorMap <: AbstractVector suggestions was just something I wanted to throw out there, not something I would myself be in big favor of. But I do think that it is important for further progress that we can easily interface with libraries that only accept array-like data, and I do think we should have a public interface for that. I am not too concerned about AdjointTensorMap and BlockTensorMap; in the cases where you want to use e.g. ODE solvers or optimisation methods, the tensors that you will act on will be normal TensorMaps I assume. I don't think that situation is dramatically different from some of the AbstractArray wrapper or otherwise custom types.

Regarding the inner product; in principle we could consider absorbing a factor of sqrt(dim(c)) in the data, and then taking it out where needed. Not sure how much overhead that would create and whether it would be compensated by its benefits.

Sep 11 '24 21:09 Jutho

I wouldn't mind absorbing the normalization in the data, but I think in that case we should also consider changing the normalization of the fusiontrees, etc, to the isotopic normalization (I think that would be equivalent?) which is definitely not something I look forward to having to do. On the other hand, it does seem like the isotopic normalization convention appears somewhat more often in our use-cases (string-nets etc), so maybe it might be worth it...

Sep 12 '24 08:09 lkdvos

Another question that comes up as I continue with this PR is that all the linear algebra factorisations and things like applying functions to 'square' TensorMaps, current implementation works as:

apply operation to individual blocks
collect the output blocks (possibly of different output arguments, e.g. in the case of factorisations)
construct output tensormaps, thereby directly using the blocks as the actual data

With the new approach, you cannot construct the tensormap from the blocks in a data sharing way, so you allocate new data and copy the blocks into this. This causes (a potentially great deal of) extra allocations.

Maybe this can be resolved by writing mutating versions of these operations in MatrixAlgebra that accept the output blocks as arguments? I don't know if there are better suggestions?

Sep 12 '24 14:09 Jutho

For the MatrixAlgebra, indeed, I cannot really think of a different solution. However, considering that LinearAlgebra does not expose this for these functions, it might be that the overhead of the extra copy is really just negligible for those operations? In any case, I think the code looks a bit nicer too if we don't have to first create all the block data and only then construct the output tensor, which also separates out the allocation and implementation cleanly.

for c in blocksectors(A)
    eig!((block(V, c), block(D, c)), block(A, c)) 
end

Sep 13 '24 07:09 lkdvos

Ok, I finally managed to get a first completely working version. Most of the factorisations are implemented as discussed above by just copying blocks into the output position. This could be done with only minimal changes to the code. Only the SVD (and truncation) implementation was changed (improved in my opinion) quite drastically. There are some things left to do, namely:

[x] Update CI to include Julia 1.10 and 1.11
[x] Fix docs (in particular, remove references to TrivialTensorMap)
[ ] Investigate why my alternative implementation of FiniteDifferences.to_vec does not work (currently commented out)
[x] Benchmark the impact of this PR, and in particular the different cache choices for fusionblockstructure(::HomSpace)

In future PRs, the sequel to this PR needs to be:

Streamline matrix factorisation operations so that no additional allocation is necessary, and also change the interface to support backend and allocator keywords
Make sure that the TensorOperations backends and allocator info are passed on and correctly handled all the way down to the lowest level (e.g. add_...)
Add scheduler keyword options, both for operations that act on the matrix block level, and operations that act on the fusion tree block levels.

Oct 20 '24 08:10 Jutho

The basic benchmarks from the benchmark folder seem to convey that (contraction) performance should remain approximately the same. It will be worthwhile to combine this with Bumper.jl, as in that case there should be almost no remaining allocations (bumper allocates the data part, and the structure part is now cached), and to try some of the more challenging cases with several symmetries combined, which was one of the reasons behind the motivation for caching the structure info.

mporesults.pdf peporesults.pdf

Oct 23 '24 22:10 Jutho

I think this is ready to be merged. I'll leave it open until the end of the weekend to await other comments. The updated docs can be viewed here: https://jutho.github.io/TensorKit.jl/previews/PR140/

Oct 25 '24 07:10 Jutho

I haven't investigated it further, but I see a significant (10x) performance (1ms to 8ms) decrease when switching from v0.12.7 to this branch. I also expected having to do even less matmuls than I still need to do on this branch.


using TensorKit, BenchmarkTools

include("submult.jl")

let
    L1 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-4, 0, 0)=>3, (-6, 0, 0)=>24, (-8, 0, 0)=>34, (-10, 0, 0)=>12, (-12, 0, 0)=>1, (-4, 1, 0)=>2, (-6, 1, 0)=>29, (-8, 1, 0)=>42, (-10, 1, 0)=>13, (-6, 2, 0)=>2, (-8, 2, 0)=>4, (-5, 1/2, 1)=>18, (-7, 1/2, 1)=>53, (-9, 1/2, 1)=>34, (-11, 1/2, 1)=>3, (-5, 3/2, 1)=>4, (-7, 3/2, 1)=>18, (-9, 3/2, 1)=>8)
    L2 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((0, 0, 0)=>1, (2, 0, 0)=>1, (1, 1/2, 1)=>1)
    R1 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-4, 0, 0)=>3, (-6, 0, 0)=>24, (-8, 0, 0)=>34, (-10, 0, 0)=>12, (-12, 0, 0)=>1, (-4, 1, 0)=>2, (-6, 1, 0)=>29, (-8, 1, 0)=>42, (-10, 1, 0)=>13, (-6, 2, 0)=>2, (-8, 2, 0)=>4, (-5, 1/2, 1)=>18, (-7, 1/2, 1)=>53, (-9, 1/2, 1)=>34, (-11, 1/2, 1)=>3, (-5, 3/2, 1)=>4, (-7, 3/2, 1)=>18, (-9, 3/2, 1)=>8)
    R2 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((0, 0, 0)=>1, (2, 0, 0)=>1, (1, 1/2, 1)=>1)
    R3 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-1, 1/2, 1)=>1)

    R4 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-4, 0, 0)=>3, (-6, 0, 0)=>25, (-8, 0, 0)=>34, (-10, 0, 0)=>12, (-12, 0, 0)=>1, (-4, 1, 0)=>2, (-6, 1, 0)=>31, (-8, 1, 0)=>44, (-10, 1, 0)=>13, (-6, 2, 0)=>2, (-8, 2, 0)=>5, (-5, 1/2, 1)=>18, (-7, 1/2, 1)=>50, (-9, 1/2, 1)=>30, (-11, 1/2, 1)=>2, (-5, 3/2, 1)=>4, (-7, 3/2, 1)=>20, (-9, 3/2, 1)=>6) 
    R5 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((0, 0, 0)=>1, (2, 0, 0)=>1, (1, 1/2, 1)=>1)'


    T1 = TensorMap{Float64}(undef,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))*0 # I'd like to init with rand, but that is not yet implemented?
    T2 = TensorMap{Float64}(undef,(R1 ⊗ R2) ← (R4 ⊗ R5))*0;
    T3 = TensorMap{Float64}(undef,(L1 ⊗ L2) ← (R4 ⊗ R5 ⊗ R3))*0;

    #T1 = TensorMap(rand,Float64,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))
    #T2 = TensorMap(rand,Float64,(R1 ⊗ R2) ← (R4 ⊗ R5));
    #T3 = TensorMap(rand,Float64,(L1 ⊗ L2) ← (R4 ⊗ R5 ⊗ R3));

    lsm = LeftSubMult(space(T1),space(T2))

    @benchmark begin
        rmul!($T3,false)
        $lsm($T3,$T1,$T2)
    end

end

The code that I have is convoluted, but the thing I'm timing is extremely straightforward, and can be found at the bottom of submult.txt (attached as txt - github refuses .jl)

Oct 26 '24 09:10 maartenvd

I might be missing something, but I quickly looked at your file and it seems at the bottom, you are using sectors to index into the data field of the tensor, which should now be a dense vector, so I don't think that can work?

EDIT: I also don't expect to see any performance gain in what you have there. In some sense it is a bit unfair, you crafted your table such that it makes use of efficiently indexing first a block, then subparts of that block, while now it would make more sense to immediately take a view of the underlying data in one go.

Oct 26 '24 10:10 lkdvos

I completely forgot that the underlying data field also changed! The speedup would come from the re-ordered fusiontrees, but I'll need to change the code a little bit

Oct 26 '24 10:10 maartenvd

I'll also take a look at this asap.Serious performance loss should indeed not happen and would have to be resolved before merging this.

Regarding the random initialization; the new syntax is:

rand(Float64,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))

That was actually in another PR from @lkdvos , but I did implement deprecation warnings for that change in this PR, so in principle the old approach should still work? I see what is wrong; the old syntax was TensorMap(rand, Float64, ...). This is still supported but in deprecation mode. I don't think TensorMap{Float64}(rand, ...) was ever supported?

Oct 26 '24 11:10 Jutho

I was testing an old version of jh/tensorstructure (with some extra local changes). I will reimplement on the latest master today, it really should be strictly faster (less matmuls and less dictionary indexing)

On Sat, 26 Oct 2024, 13:39 Jutho, @.***> wrote:

I'll also take a look at this asap.Serious performance loss should indeed not happen and would have to be resolved before merging this.

Regarding the random initialization; the new syntax is:

rand(Float64,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))

That was actually in another PR from @lkdvos https://github.com/lkdvos , but I did implement deprecation warnings for that change in this PR, so in principle the old approach should still work? I see what is wrong; the old syntax was TensorMap(rand, Float64, ...). This is still supported but in deprecation mode. I don't think TensorMap{Float64}(rand, ...) was ever supported?

— Reply to this email directly, view it on GitHub https://github.com/Jutho/TensorKit.jl/pull/140#issuecomment-2439544965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCVMXEY5QXEKLKISQTTZ5N5ODAVCNFSM6AAAAABLUNBI4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGU2DIOJWGU . You are receiving this because you commented.Message ID: <Jutho/TensorKit. @.***>

Oct 26 '24 13:10 maartenvd

With latest master you mean latest version of this branch right?

Oct 26 '24 13:10 lkdvos

yep - will also check if the blockstructure caching gives a sufficient enough speedup to do away with some of the uglier things in mpskitexperimental/qchem, but that will take more time. In general I'm always a bit afraid of the possible performance hit of blocking caches

On Sat, 26 Oct 2024, 15:10 Lukas Devos, @.***> wrote:

With latest master you mean latest version of this branch right?

— Reply to this email directly, view it on GitHub https://github.com/Jutho/TensorKit.jl/pull/140#issuecomment-2439577855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCWL3LFYZCWE57T73V3Z5OIDRAVCNFSM6AAAAABLUNBI4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGU3TOOBVGU . You are receiving this because you commented.Message ID: <Jutho/TensorKit. @.***>

Oct 26 '24 13:10 maartenvd

yep - will also check if the blockstructure caching gives a sufficient enough speedup to do away with some of the uglier things in mpskitexperimental/qchem, but that will take more time. In general I'm always a bit afraid of the possible performance hit of blocking caches …

I am also suspicious of the blocking cache, so some benchmark in an actual test case that involves several tasks/threads would be great. That's why there is also a task local cache implementation, though there is not yet functionality for readily selecting that, other than replacing the existing method by doing

TensorKit.function CacheStyle(I::Type{<:Sector})
    return TaskLocalCache{Dict{Any,Any}}()
end

Yet another alternative would be a global cache based on https://github.com/JuliaConcurrent/MultiThreadedCaches.jl

Oct 26 '24 14:10 Jutho

My updated submult is even more of a mess but

it now indeed only does 27 mul! calls, instead of 281
it is only a little bit slower (1ms -> 1.1ms)

The slowdown might be due to the bad alignment or something more subtle. It is not significant enough to care, but quite surprising nevertheless.

submult.txt

Oct 26 '24 16:10 maartenvd

TensorKit.jl TensorKit.jl copied to clipboard

WIP: redesign tensormap structure

Codecov Report

TensorKit.jl
TensorKit.jl copied to clipboard