dbcsr Limit on MPI allocation size is too strict

DBCSR implements a limit to the amount of data MPI can allocate in one go here.

Seemingly, this limit is due to the fact that MPI functions take an int argument for the element count, thereby limiting this value to INT_MAX or less, see an explanation here.

Test case

Running

CP2K built without GPU support
commit git:d09bae1 from @pseewald 's cp2k branch
Input file: custom H2O-HFX.inp from @pseewald .
on Piz Daint, GPU partition, on 4 nodes, 1 MPI rank/node, 12 OMP threads / MPI rank
MPI from module cray-mpich/7.7.10

As expected, I receive the following DBCSR abort message:

MPI cannot allocate more than 2 GiByte

However, if I comment out this abort and re-run the same executable with the same input, the program runs to completion without an issue.

It must be that the limit on the MPI allocation size is set unnecessarilly low.

Tentative solution

The limitation on messages size in MPI_ALLOC_MEM does not come from the message size per se, but from the fact that MPI uses ints (more particularly, INTEGER(KIND=MPI_ADDRESS_KIND, if I understand correctly), of only 32 bits to encode the size argument. Therefore, size cannot be larger than 2^31.

But if so, then shouldn't the limit be HUGE(INT(1, KIND=MPI_ADDRESS_KIND) instead of HUGE(INT(1, KIND=int_4), for example, as so:

--- a/src/mpi/dbcsr_mpiwrap.F
+++ b/src/mpi/dbcsr_mpiwrap.F
@@ -67,7 +67,7 @@ MODULE dbcsr_mpiwrap
    INTEGER, PARAMETER, PUBLIC :: mp_status_size = MPI_STATUS_SIZE
    INTEGER, PARAMETER, PUBLIC :: mp_proc_null = MPI_PROC_NULL
    ! Set max allocatable memory by MPI to 2 GiByte
-   INTEGER(KIND=MPI_ADDRESS_KIND), PARAMETER, PRIVATE :: mp_max_memory_size = HUGE(INT(1, KIND=int_4))
+   INTEGER(KIND=MPI_ADDRESS_KIND), PARAMETER, PRIVATE :: mp_max_memory_size = HUGE(INT(1, KIND=MPI_ADDRESS_KIND))

 #if __MPI_VERSION > 2
    INTEGER, PARAMETER, PUBLIC :: mp_max_library_version_string = MPI_MAX_LIBRARY_VERSION_STRING
@@ -6079,8 +6079,8 @@ CONTAINS
       length = MAX(len, 1)
       CALL MPI_TYPE_SIZE(${mpi_type1}$, size, ierr)
       mp_size = INT(length, KIND=MPI_ADDRESS_KIND)*size
-      IF (mp_size .GT. mp_max_memory_size) THEN
-         DBCSR_ABORT("MPI cannot allocate more than 2 GiByte")
+      IF (mp_size .GE. mp_max_memory_size) THEN
+         DBCSR_ABORT("MPI cannot allocate more than 2 GigaElements")
       ENDIF
       mp_info = MPI_INFO_NULL
       CALL MPI_ALLOC_MEM(mp_size, mp_info, mp_baseptr, mp_res)

Mar 03 '20 14:03 shoshijak

should presumably be written as

IF (length  .GT. mp_max_memory_size / size) THEN
     DBCSR_ABORT("MPI cannot allocate more than 2 GigaElements or 2 GB on 32 bit systems")

to avoid overflow. Also

! Set max allocatable memory by MPI to 2 GiByte on 32 bit systems

@alazzaro is his patch something for vs 2.1.0 ?

Mar 04 '20 17:03 vondele

Well, the length of the buffers is a int_4 (I mean nze elements for instance). We need to change in other places to get full benefit of sending more than 2GB... My current plan is to have 2.1 for multi-gpu (ETA ~1 week). I would prefer to leave this problem for 2.2...

Mar 04 '20 17:03 alazzaro

well, the difference is being able to put 2GB or 16GB (for doubles) per rank on the device. All of DBCSR does the counting in elements, only this place is in bytes (AFAICT)... as we should aim for 1 MPI rank per device (or 1 per node....), this difference in memory usage might be important.

It is a much larger task to move all of dbcsr to int_8 (especially because many supporting libs like MPI, BLAS, etc don't have those interfaces, commonly), so much so that I don't think it can be done in a couple of months, or should be tried lightly.

Of course, up to you to pick the most reasonable release schedule.

Mar 04 '20 20:03 vondele

I agree with the change. I was not referring to move "everything" to int_8 (yes, a lot of work). My point is that there can be some places where we have to check if we overflow. For instance, assuming that you move and allocate that size of data, I've checked where we use the size in bytes:

> grep -R dbcsr_datatype_sizeof *

For example, file block/dbcsr_block_operations.F:

 IF (acc_devmem_allocated(area%d%acc_devmem)) THEN
         IF (PRESENT(value)) &
            DBCSR_ABORT("dbcsr_data_clear0 with value not implemented for acc_devmem")
         s = dbcsr_datatype_sizeof(area%d%data_type)
         CALL acc_devmem_setzero_bytes(area%d%acc_devmem, s*l, s*u, area%d%memory_type%acc_stream)
      END IF

Here s and l are int_4, so it will likely overflow.

Mar 04 '20 21:03 alazzaro

OK, right... so it needs a careful audit.

Mar 04 '20 21:03 vondele

For the reference, I've currently made MPI allocation off by default. It appears to be useless (or worsen) in most of the cases...

Mar 11 '20 08:03 alazzaro

OK, but now the other locations that you pointed out, where bytes are counted using 32-bit integers, might fail silently.

Mar 11 '20 08:03 vondele

Yes, the issue is still open for that... Just changed the milestone to 2.1...

Mar 11 '20 09:03 alazzaro

Has there been any progress on this? The limit on the allocation size can be quite cumbersome, especially for tensors. Should I have a go at it?

Sep 13 '21 06:09 ambmax00

dbcsr dbcsr copied to clipboard

Limit on MPI allocation size is too strict

dbcsr
dbcsr copied to clipboard