dbcsr
dbcsr copied to clipboard
Limit on MPI allocation size is too strict
DBCSR implements a limit to the amount of data MPI can allocate in one go here.
Seemingly, this limit is due to the fact that MPI functions take an int argument for the element count, thereby limiting this value to INT_MAX or less, see an explanation here.
Test case
Running
- CP2K built without GPU support
- commit
git:d09bae1
from @pseewald 's cp2k branch - Input file: custom
H2O-HFX.inp
from @pseewald . - on Piz Daint, GPU partition, on 4 nodes, 1 MPI rank/node, 12 OMP threads / MPI rank
- MPI from module
cray-mpich/7.7.10
As expected, I receive the following DBCSR abort message:
MPI cannot allocate more than 2 GiByte
However, if I comment out this abort and re-run the same executable with the same input, the program runs to completion without an issue.
It must be that the limit on the MPI allocation size is set unnecessarilly low.
Tentative solution
The limitation on messages size in MPI_ALLOC_MEM
does not come from the message size per se, but from the fact that MPI uses ints (more particularly, INTEGER(KIND=MPI_ADDRESS_KIND, if I understand correctly)
, of only 32 bits to encode the size
argument. Therefore, size
cannot be larger than 2^31
.
But if so, then shouldn't the limit be HUGE(INT(1, KIND=MPI_ADDRESS_KIND)
instead of HUGE(INT(1, KIND=int_4)
, for example, as so:
--- a/src/mpi/dbcsr_mpiwrap.F
+++ b/src/mpi/dbcsr_mpiwrap.F
@@ -67,7 +67,7 @@ MODULE dbcsr_mpiwrap
INTEGER, PARAMETER, PUBLIC :: mp_status_size = MPI_STATUS_SIZE
INTEGER, PARAMETER, PUBLIC :: mp_proc_null = MPI_PROC_NULL
! Set max allocatable memory by MPI to 2 GiByte
- INTEGER(KIND=MPI_ADDRESS_KIND), PARAMETER, PRIVATE :: mp_max_memory_size = HUGE(INT(1, KIND=int_4))
+ INTEGER(KIND=MPI_ADDRESS_KIND), PARAMETER, PRIVATE :: mp_max_memory_size = HUGE(INT(1, KIND=MPI_ADDRESS_KIND))
#if __MPI_VERSION > 2
INTEGER, PARAMETER, PUBLIC :: mp_max_library_version_string = MPI_MAX_LIBRARY_VERSION_STRING
@@ -6079,8 +6079,8 @@ CONTAINS
length = MAX(len, 1)
CALL MPI_TYPE_SIZE(${mpi_type1}$, size, ierr)
mp_size = INT(length, KIND=MPI_ADDRESS_KIND)*size
- IF (mp_size .GT. mp_max_memory_size) THEN
- DBCSR_ABORT("MPI cannot allocate more than 2 GiByte")
+ IF (mp_size .GE. mp_max_memory_size) THEN
+ DBCSR_ABORT("MPI cannot allocate more than 2 GigaElements")
ENDIF
mp_info = MPI_INFO_NULL
CALL MPI_ALLOC_MEM(mp_size, mp_info, mp_baseptr, mp_res)
should presumably be written as
IF (length .GT. mp_max_memory_size / size) THEN
DBCSR_ABORT("MPI cannot allocate more than 2 GigaElements or 2 GB on 32 bit systems")
to avoid overflow. Also
! Set max allocatable memory by MPI to 2 GiByte on 32 bit systems
@alazzaro is his patch something for vs 2.1.0 ?
Well, the length of the buffers is a int_4 (I mean nze elements for instance). We need to change in other places to get full benefit of sending more than 2GB... My current plan is to have 2.1 for multi-gpu (ETA ~1 week). I would prefer to leave this problem for 2.2...
well, the difference is being able to put 2GB or 16GB (for doubles) per rank on the device. All of DBCSR does the counting in elements, only this place is in bytes (AFAICT)... as we should aim for 1 MPI rank per device (or 1 per node....), this difference in memory usage might be important.
It is a much larger task to move all of dbcsr to int_8 (especially because many supporting libs like MPI, BLAS, etc don't have those interfaces, commonly), so much so that I don't think it can be done in a couple of months, or should be tried lightly.
Of course, up to you to pick the most reasonable release schedule.
I agree with the change. I was not referring to move "everything" to int_8 (yes, a lot of work). My point is that there can be some places where we have to check if we overflow. For instance, assuming that you move and allocate that size of data, I've checked where we use the size in bytes:
> grep -R dbcsr_datatype_sizeof *
For example, file block/dbcsr_block_operations.F:
IF (acc_devmem_allocated(area%d%acc_devmem)) THEN
IF (PRESENT(value)) &
DBCSR_ABORT("dbcsr_data_clear0 with value not implemented for acc_devmem")
s = dbcsr_datatype_sizeof(area%d%data_type)
CALL acc_devmem_setzero_bytes(area%d%acc_devmem, s*l, s*u, area%d%memory_type%acc_stream)
END IF
Here s
and l
are int_4, so it will likely overflow.
OK, right... so it needs a careful audit.
For the reference, I've currently made MPI allocation off by default. It appears to be useless (or worsen) in most of the cases...
OK, but now the other locations that you pointed out, where bytes are counted using 32-bit integers, might fail silently.
Yes, the issue is still open for that... Just changed the milestone to 2.1...
Has there been any progress on this? The limit on the allocation size can be quite cumbersome, especially for tensors. Should I have a go at it?