TensorKit.jl
TensorKit.jl copied to clipboard
WIP: redesign tensormap structure
Plan:
- [x] New fusion tree iterator, that accepts sector iterators for outgoing and uses new ordering that sorts fusion trees such that left most values are changing the quickest in the list: outgoing[1], outgoing[2], vertex[1], inner[1], outgoing[3], vertex[2], inner[2], …
- [x] Encode all structure information at the level of the
HomSpaceand experiment with different strategies for caching this information - [x] Store all the tensor data in one sequential DenseVector
- [ ] (optional): Experiment whether it is beneficial to use a
Memoryobject in Julia 1.11 and beyond.
Codecov Report
Attention: Patch coverage is 84.97653% with 160 lines in your changes missing coverage. Please review.
Project coverage is 81.02%. Comparing base (
419f88c) to head (aca1263). Report is 51 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #140 +/- ##
==========================================
+ Coverage 79.72% 81.02% +1.30%
==========================================
Files 42 41 -1
Lines 4961 5001 +40
==========================================
+ Hits 3955 4052 +97
+ Misses 1006 949 -57
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I've been continuing with this PR and will soon push an update. One set of questions that comes up is what we do with indexing, in particular to TrivialTensorMap.
- Defining
TrivialTensorMapas a type alias is already less convenient, since the sectortype is no longer part of the parameter list ofTensorMap. I am hesitant to make unions such asS<:Union{ComplexSpace,CartesianSpace,GeneralSpace}since that is not extensible, so I would just get rid of the type alias. - Indexing into
TrivialTensorMapcan use no argumentt[](return the data as N=N1+N2-dimensional tensor),t[i1,...,iN](returning a scalar, equivalent to t[][i1,...,iN]),t[(nothing,nothing)]wherenothingis a substitute for the trivial fusion tree (and thust[(nothing,nothing)] == t[]).
The question is whether we want to keep all of those. Without a type alias, they would start with sectortype(t) == Trivial || throw(error...).
Another question is what the interface needs to be to get the data vector of a TensorMap; do we just use t.data or do we want an explicit interface for that. Before the data was the list of blocks, and so blocks(t) and block(t,c) was the public interface to access the raw data. However, simple vector space methods, in particular everything from VectorInterface.jl, can now be rewritten to act directly on the data vector without needing any of the structure info. So acting on the data field is quite common.
I think my first reaction is to indeed just get rid of TrivialTensorMap, as having to special-case this was already not so convenient. I really don't like the getindex(::TrivialTensorMap, ::Nothing, ::Nothing) method, and I think that unifying the non-symmetric and symmetric tensors might also promote writing code that is not specific to no symmetry.
This being said, I would definitely like to keep the indexing behavior for non-symmetric tensors, both with scalar getindex and setindex methods, and even to consider allowing slices/view. It is still not too straightforward to fill up a symmetric tensor, so I don't think we want to get rid of this.
I have to say that I would even be in favor of having a getindex for symmetric tensors, which would act in a way similar to CUDA arrays, where you can do it but you have to either be in the REPL, or explicitly declare that you want to do scalar indexing into a symmetric tensor, with warnings etc for the performance pitfalls. For Abelian symmetries we could even consider allowing setindex, with the same caveats?
I don't have a strong opinion on having an interface for accessing the data. From what I understand, the fields of a type are never really considered public, and for TensorMap this has also never really been the case, the exact way the data is stored is kept hidden as much as possible. Given that it is mostly VectorInterface that can use direct access to the field to speed up the implementation, I'm okay with that being "don't try this at home direct field access" :grin: . I also don't think there is a lot of benefit to have this as part of the interface, and would really consider it an implementation detail.
I am all in favor of getting rid of the (nothing, nothing) indexing.
For the nonsymmetric / trivially symmetric tensors, the identity
t[scalarindices...] == t[][scalarindices...]
implies that maybe the left one is not strictly necessary, as it is still accessible (in an even more flexible way that also allows slicing etc) as long als t[] still works. However, maybe we rather want to deprecate instead t[].
The reason that I would consider providing an interface to at least get the data, is that unfortunately there are many external libraries that only work on AbstractArray or AbstractVector instances, e.g. DifferentialEquations.jl, several nonlinear optimisation libraries, … . At least with the new data structure, it will become easier to use these in combination with TensorMap instances, as it will be a matter of doing
some_external_function(f, t.data)
where f is a function that encodes an operation on tensors, but now needs to be written as
function f(data)
# reconstruct tensor
t = TensorMap(data, space)
# act
tnew = # act on t ...
# return data in case output is also a tensor (e.g. ODE right hand side)
return tnew.data
end
We could even go one step further and make AbstractTensorMap{T} <: AbstractVector{T} with length(t) = dim(t). In that case, getindex and setindex! with a single index need to be supported and acts directly on t.data.
So I think this requires some careful deliberation weighing the pros and cons of different approaches.
I think for the question of t[][inds...] vs t[inds...], my preference goes to the latter. The main reasons being that it is just a little more intuitive, it is a correct overload of Base.getindex, and is more intuitive for people that are used to Array objects. Additionally, t[] necessarily has to work with views, which can easily become confusing, and while diagonalt[inds] = val could in principle be made to work, this becomes a lot more difficult with diagonalt[][inds] = val.
I have no real problem with having to copy over some of the AbstractArray implementations to make these convenience methods work.
I am not a big fan of AbstractTensorMap <: AbstractVector in the way you describe, because really that only applies to TensorMap. For AdjointTensorMap, this now has a very weird interplay with what getindex should do, as the data is now non-trivially permuted. For DiagonalTensorMap you in principle no longer have dim(diagt) = length(diagt.data), and for BlockTensorMap this becomes a big hassle to make the indexing work as wanted because now it would be an AbstractArray, but not in the way that is intuitive. I think it would create more problems than it would solve.
As a sidenote, we already have some functionality for the type of functions you describe through FiniteDifferences.to_vec:
t0 = input_tensor
v0, from_vec = to_vec(to)
function my_wrapper_fun(v::Vector)
t = from_vec(v)
[...]
return to_vec(result)[1]
end
I also just realised that this allows a little more fine-grained control over the fact that our data does not have the inner product you would expect whenever non-abelian symmetries are involved, i.e. inner(x.data, y.data) != inner(x, y) in general. This is a really subtle problem that makes me even more hesitant to make the data part of the interface :upside_down_face:
There are all good points; the AbstractTensorMap <: AbstractVector suggestions was just something I wanted to throw out there, not something I would myself be in big favor of. But I do think that it is important for further progress that we can easily interface with libraries that only accept array-like data, and I do think we should have a public interface for that. I am not too concerned about AdjointTensorMap and BlockTensorMap; in the cases where you want to use e.g. ODE solvers or optimisation methods, the tensors that you will act on will be normal TensorMaps I assume. I don't think that situation is dramatically different from some of the AbstractArray wrapper or otherwise custom types.
Regarding the inner product; in principle we could consider absorbing a factor of sqrt(dim(c)) in the data, and then taking it out where needed. Not sure how much overhead that would create and whether it would be compensated by its benefits.
I wouldn't mind absorbing the normalization in the data, but I think in that case we should also consider changing the normalization of the fusiontrees, etc, to the isotopic normalization (I think that would be equivalent?) which is definitely not something I look forward to having to do. On the other hand, it does seem like the isotopic normalization convention appears somewhat more often in our use-cases (string-nets etc), so maybe it might be worth it...
Another question that comes up as I continue with this PR is that all the linear algebra factorisations and things like applying functions to 'square' TensorMaps, current implementation works as:
- apply operation to individual blocks
- collect the output blocks (possibly of different output arguments, e.g. in the case of factorisations)
- construct output tensormaps, thereby directly using the blocks as the actual data
With the new approach, you cannot construct the tensormap from the blocks in a data sharing way, so you allocate new data and copy the blocks into this. This causes (a potentially great deal of) extra allocations.
Maybe this can be resolved by writing mutating versions of these operations in MatrixAlgebra that accept the output blocks as arguments? I don't know if there are better suggestions?
For the MatrixAlgebra, indeed, I cannot really think of a different solution. However, considering that LinearAlgebra does not expose this for these functions, it might be that the overhead of the extra copy is really just negligible for those operations? In any case, I think the code looks a bit nicer too if we don't have to first create all the block data and only then construct the output tensor, which also separates out the allocation and implementation cleanly.
for c in blocksectors(A)
eig!((block(V, c), block(D, c)), block(A, c))
end
Ok, I finally managed to get a first completely working version. Most of the factorisations are implemented as discussed above by just copying blocks into the output position. This could be done with only minimal changes to the code. Only the SVD (and truncation) implementation was changed (improved in my opinion) quite drastically. There are some things left to do, namely:
- [x] Update CI to include Julia 1.10 and 1.11
- [x] Fix docs (in particular, remove references to
TrivialTensorMap) - [ ] Investigate why my alternative implementation of
FiniteDifferences.to_vecdoes not work (currently commented out) - [x] Benchmark the impact of this PR, and in particular the different cache choices for
fusionblockstructure(::HomSpace)
In future PRs, the sequel to this PR needs to be:
- Streamline matrix factorisation operations so that no additional allocation is necessary, and also change the interface to support backend and allocator keywords
- Make sure that the TensorOperations backends and allocator info are passed on and correctly handled all the way down to the lowest level (e.g.
add_...) - Add
schedulerkeyword options, both for operations that act on the matrix block level, and operations that act on the fusion tree block levels.
The basic benchmarks from the benchmark folder seem to convey that (contraction) performance should remain approximately the same. It will be worthwhile to combine this with Bumper.jl, as in that case there should be almost no remaining allocations (bumper allocates the data part, and the structure part is now cached), and to try some of the more challenging cases with several symmetries combined, which was one of the reasons behind the motivation for caching the structure info.
I think this is ready to be merged. I'll leave it open until the end of the weekend to await other comments. The updated docs can be viewed here: https://jutho.github.io/TensorKit.jl/previews/PR140/
I haven't investigated it further, but I see a significant (10x) performance (1ms to 8ms) decrease when switching from v0.12.7 to this branch. I also expected having to do even less matmuls than I still need to do on this branch.
using TensorKit, BenchmarkTools
include("submult.jl")
let
L1 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-4, 0, 0)=>3, (-6, 0, 0)=>24, (-8, 0, 0)=>34, (-10, 0, 0)=>12, (-12, 0, 0)=>1, (-4, 1, 0)=>2, (-6, 1, 0)=>29, (-8, 1, 0)=>42, (-10, 1, 0)=>13, (-6, 2, 0)=>2, (-8, 2, 0)=>4, (-5, 1/2, 1)=>18, (-7, 1/2, 1)=>53, (-9, 1/2, 1)=>34, (-11, 1/2, 1)=>3, (-5, 3/2, 1)=>4, (-7, 3/2, 1)=>18, (-9, 3/2, 1)=>8)
L2 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((0, 0, 0)=>1, (2, 0, 0)=>1, (1, 1/2, 1)=>1)
R1 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-4, 0, 0)=>3, (-6, 0, 0)=>24, (-8, 0, 0)=>34, (-10, 0, 0)=>12, (-12, 0, 0)=>1, (-4, 1, 0)=>2, (-6, 1, 0)=>29, (-8, 1, 0)=>42, (-10, 1, 0)=>13, (-6, 2, 0)=>2, (-8, 2, 0)=>4, (-5, 1/2, 1)=>18, (-7, 1/2, 1)=>53, (-9, 1/2, 1)=>34, (-11, 1/2, 1)=>3, (-5, 3/2, 1)=>4, (-7, 3/2, 1)=>18, (-9, 3/2, 1)=>8)
R2 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((0, 0, 0)=>1, (2, 0, 0)=>1, (1, 1/2, 1)=>1)
R3 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-1, 1/2, 1)=>1)
R4 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((-4, 0, 0)=>3, (-6, 0, 0)=>25, (-8, 0, 0)=>34, (-10, 0, 0)=>12, (-12, 0, 0)=>1, (-4, 1, 0)=>2, (-6, 1, 0)=>31, (-8, 1, 0)=>44, (-10, 1, 0)=>13, (-6, 2, 0)=>2, (-8, 2, 0)=>5, (-5, 1/2, 1)=>18, (-7, 1/2, 1)=>50, (-9, 1/2, 1)=>30, (-11, 1/2, 1)=>2, (-5, 3/2, 1)=>4, (-7, 3/2, 1)=>20, (-9, 3/2, 1)=>6)
R5 = Vect[(Irrep[U₁] ⊠ Irrep[SU₂] ⊠ FermionParity)]((0, 0, 0)=>1, (2, 0, 0)=>1, (1, 1/2, 1)=>1)'
T1 = TensorMap{Float64}(undef,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))*0 # I'd like to init with rand, but that is not yet implemented?
T2 = TensorMap{Float64}(undef,(R1 ⊗ R2) ← (R4 ⊗ R5))*0;
T3 = TensorMap{Float64}(undef,(L1 ⊗ L2) ← (R4 ⊗ R5 ⊗ R3))*0;
#T1 = TensorMap(rand,Float64,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))
#T2 = TensorMap(rand,Float64,(R1 ⊗ R2) ← (R4 ⊗ R5));
#T3 = TensorMap(rand,Float64,(L1 ⊗ L2) ← (R4 ⊗ R5 ⊗ R3));
lsm = LeftSubMult(space(T1),space(T2))
@benchmark begin
rmul!($T3,false)
$lsm($T3,$T1,$T2)
end
end
The code that I have is convoluted, but the thing I'm timing is extremely straightforward, and can be found at the bottom of submult.txt (attached as txt - github refuses .jl)
I might be missing something, but I quickly looked at your file and it seems at the bottom, you are using sectors to index into the data field of the tensor, which should now be a dense vector, so I don't think that can work?
EDIT: I also don't expect to see any performance gain in what you have there. In some sense it is a bit unfair, you crafted your table such that it makes use of efficiently indexing first a block, then subparts of that block, while now it would make more sense to immediately take a view of the underlying data in one go.
I completely forgot that the underlying data field also changed! The speedup would come from the re-ordered fusiontrees, but I'll need to change the code a little bit
I'll also take a look at this asap.Serious performance loss should indeed not happen and would have to be resolved before merging this.
Regarding the random initialization; the new syntax is:
rand(Float64,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))
That was actually in another PR from @lkdvos , but I did implement deprecation warnings for that change in this PR, so in principle the old approach should still work? I see what is wrong; the old syntax was TensorMap(rand, Float64, ...). This is still supported but in deprecation mode. I don't think TensorMap{Float64}(rand, ...) was ever supported?
I was testing an old version of jh/tensorstructure (with some extra local changes). I will reimplement on the latest master today, it really should be strictly faster (less matmuls and less dictionary indexing)
On Sat, 26 Oct 2024, 13:39 Jutho, @.***> wrote:
I'll also take a look at this asap.Serious performance loss should indeed not happen and would have to be resolved before merging this.
Regarding the random initialization; the new syntax is:
rand(Float64,(L1 ⊗ L2) ← (R1 ⊗ R2 ⊗ R3))
That was actually in another PR from @lkdvos https://github.com/lkdvos , but I did implement deprecation warnings for that change in this PR, so in principle the old approach should still work? I see what is wrong; the old syntax was TensorMap(rand, Float64, ...). This is still supported but in deprecation mode. I don't think TensorMap{Float64}(rand, ...) was ever supported?
— Reply to this email directly, view it on GitHub https://github.com/Jutho/TensorKit.jl/pull/140#issuecomment-2439544965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCVMXEY5QXEKLKISQTTZ5N5ODAVCNFSM6AAAAABLUNBI4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGU2DIOJWGU . You are receiving this because you commented.Message ID: <Jutho/TensorKit. @.***>
With latest master you mean latest version of this branch right?
yep - will also check if the blockstructure caching gives a sufficient enough speedup to do away with some of the uglier things in mpskitexperimental/qchem, but that will take more time. In general I'm always a bit afraid of the possible performance hit of blocking caches
On Sat, 26 Oct 2024, 15:10 Lukas Devos, @.***> wrote:
With latest master you mean latest version of this branch right?
— Reply to this email directly, view it on GitHub https://github.com/Jutho/TensorKit.jl/pull/140#issuecomment-2439577855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCWL3LFYZCWE57T73V3Z5OIDRAVCNFSM6AAAAABLUNBI4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGU3TOOBVGU . You are receiving this because you commented.Message ID: <Jutho/TensorKit. @.***>
yep - will also check if the blockstructure caching gives a sufficient enough speedup to do away with some of the uglier things in mpskitexperimental/qchem, but that will take more time. In general I'm always a bit afraid of the possible performance hit of blocking caches …
I am also suspicious of the blocking cache, so some benchmark in an actual test case that involves several tasks/threads would be great. That's why there is also a task local cache implementation, though there is not yet functionality for readily selecting that, other than replacing the existing method by doing
TensorKit.function CacheStyle(I::Type{<:Sector})
return TaskLocalCache{Dict{Any,Any}}()
end
Yet another alternative would be a global cache based on https://github.com/JuliaConcurrent/MultiThreadedCaches.jl
My updated submult is even more of a mess but
- it now indeed only does 27 mul! calls, instead of 281
- it is only a little bit slower (1ms -> 1.1ms)
The slowdown might be due to the bad alignment or something more subtle. It is not significant enough to care, but quite surprising nevertheless.