dbcsr
dbcsr copied to clipboard
DBCSR tensor batching
Hi, I have a few questions, regarding how to batch tensors.
In my program, tensor contractions are batched in chunks, similarly to how it is done in CP2K, here. The chunk sizes are given by batch_size
in the input file, e.g. 10 GB. But following issue #304, a single process can only allocate 2GB at most, such that certain batch sizes above that threshold give errors (segmentation faults), if not enough processes are used. If I set my memory cut low enough, it works and the program executes normally.
So what you do in that case? Just make sure to batch tensor allocations and also contractions in chunks of 2 GB per process? Or just make sure to have enough processes, such that the memory cut is within that limit? Whats the best way to figure out batch sizes?
The best way to reduce memory allocation per MPI rank is to use 1 OpenMP thread per rank which should also give best performance. Then you should choose a large enough number of MPI ranks such that your calculation does not run out of memory (or that #304 is not a problem). If you are limited by the available memory or if scaling with number of MPI ranks becomes a problem you can increase the number of batches to be able to run on less cores.
Total execution time in terms of node hours should not be affected negatively if you choose a small number of batches for each tensor dimension (in CP2K RPA the default is 5). For load balancing it's better to split several tensor dimension into a small number of batches than to split just one dimension. So I recommend to set some default number of batches (which means reducing the default memory footprint of your code "for free").
I prefer to think about number of batches instead of batch size. You may argue that batch size could be automatically chosen by the code according to the available memory, however there is no way to predict memory requirement / occupancy of each tensor if tensors are sparse. Also since too many batches will cause loss of performance, it's good to control number of batches instead of batch size.
Thanks for the answer. It really makes more sense to just split the dimensions, I will do that then, too. By the way, are tensors held in core memory, e.g. the 3c2e integrals? I saw that there are compress/decompress subroutines, but those are not enabled to write to disk, as far as I can see. I guess reading from disk is a bit expensive, right?
yes as DBCSR matrices tensors are held in core memory. The compress/decompress subroutines were adapted from our Hartree-Fock code and are not part of DBCSR. I think there is a way in DBCSR to write to disk (dbcsr_binary_write
) but I have never used it, probably it would be easy to extend it to tensors.
Hi, me again.
Question: until now, I have always explicitly reordered the tensors which dbcsr_t_copy
to make them compatible for tensor contractions. But is there a difference if I don't reorder them prior to contraction? If I do a batched contraction, and give bounds1
, bounds2
... to dbcsr_t_contract
, does it reorder the whole tensor or just the bounded/cropped tensor?
Another question: when contracting batch-wise, is there any time/memory penalty in using multiple, split tensors, i.e.
TYPE(dbcsr_t_type), DIMENSION(number_of_batches) :: tensor_array
instead of a whole tensor and using bounds? It would be very useful for me, because I'm writing a C++ batched tensor class which is capable of either holding the tensors in core-memory, reading them from disk, or computing the batches directly. And using a simple tensor vector would make it possible to just write one algorithm for all cases for core/disk/direct.
But is there a difference if I don't reorder them prior to contraction?
If performance matters, you should do the reordering yourself. I decided to leave it up to the user to convert tensors to a contraction-compatible layout because if this was made fully internal, each tensor may be reordered in every contraction which would be rather costly. Also the internal optimizations for batched contraction require contraction-compatible tensor layouts. This reminds me to implement a warning to notify users if tensors are not compatible.
If I do a batched contraction, and give bounds1, bounds2 ... to dbcsr_t_contract, does it reorder the whole tensor or just the bounded/cropped tensor?
only the part of the tensor that is within the given bounds.
when contracting batch-wise, is there any time/memory penalty in using multiple, split tensors
The optimizations for batched contraction assume that always the same tensor objects participate in the contraction steps because the parallel layout is optimized over the course of a contraction. So your suggestion of using multiple tensor objects would not work.
Batched contraction now has side effects that may conflict with the way you're currently using it: Basically only the operations dbcsr_t_contract
and dbcsr_t_copy
are safe to use within a batched contraction scope since the parallel layout of all tensors may change, and the result tensor may not be updated until dbcsr_t_batched_contract_finalize
is called. If you do other operations (such as reading/writing to disk) you should use different tensor objects that don't participate in the contraction. Then you can use vectors of tensors for storing / accessing the data and you use 3 separate tensor objects for performing a contraction (transferring the data with dbcsr_t_copy
).
Right now the feature of batched contraction is experimental and a bit confusing since it's not easy to understand when to call dbcsr_t_batched_contract_init
and dbcsr_t_batched_contract_finalize
. I'll improve this towards the end of the year, it should be easy to use.
Thanks for your help
Batched contraction now has side effects that may conflict with the way you're currently using it
I actually don't use dbcsr_t_batched_contract_init
or dbcsr_t_batched_contract_finalize
for large (split) tensors because I was aware of the dangers by looking at what the subroutines do, and their description.
I only asked myself whether there will be a (big) hit in performance if I use a vector of tensors, without calling the batched_contract subroutines, compared to using the full tensors with the batched_contract feature.
Basically only the operations dbcsr_t_contract and dbcsr_t_copy
On the other hand, I didn't know that dbcsr_t_copy
was safe, too. That might help me actually.
I only asked myself whether there will be a (big) hit in performance if I use a vector of tensors, without calling the batched_contract subroutines, compared to using the full tensors with the batched_contract feature.
I just checked with CP2K (RPA on 256 water molecules) and I found that the calculation took 40% longer if I remove the calls to batched_contract_init/finalize. You don't need to use it right now and you can have a look later once I made it more easy to use.
Is this still relevant? Please close otherwise....
Right now the feature of batched contraction is experimental and a bit confusing since it's not easy to understand when to call
dbcsr_t_batched_contract_init
anddbcsr_t_batched_contract_finalize
. I'll improve this towards the end of the year, it should be easy to use.
Let's keep it open to remind me of making this feature stable.