ompi icon indicating copy to clipboard operation
ompi copied to clipboard

The simple hello-world.c MPI program prints: shmem: mmap: an error occurred while determining whether or not /tmp/ompi.yv.1001/jf.0/3074883584/sm_segment.yv.1001.b7470000.0 could be created

Open yurivict opened this issue 1 year ago • 19 comments

See the program below.

$ ./hello-world-1 
[xx.xx.xx:12584] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.yv.1001/jf.0/3074883584/sm_segment.yv.1001.b7470000.0 could be created.

---program---

$ cat hello-world-1.c 
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("> Hello world from processor %s, rank %d out of %d processors (pid=%d)\n",
           processor_name, world_rank, world_size, getpid());
    sleep(1);
    printf("< Hello world from processor %s, rank %d out of %d processors (pid=%d)\n",
           processor_name, world_rank, world_size, getpid());

    // Finalize the MPI environment.
    MPI_Finalize();
}

Version: openmpi-5.0.5_1 Describe how Open MPI was installed: FreeBSD package Computer hardware: Intel CPU Network type: Ethernet/IP (irrelevant) Available space in /tmp: 64GB FreeBSD 14.1

yurivict avatar Aug 29 '24 23:08 yurivict

Please provide all the information from the debug issue template; thanks!

https://github.com/open-mpi/ompi/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

jsquyres avatar Aug 29 '24 23:08 jsquyres

I added missing bits of information.

yurivict avatar Aug 30 '24 04:08 yurivict

the root cause could be not enough available space in /tmp (unlikely per your description) or something went wrong when checking the size.

try running

env OMPI_MCA_shmem_base_verbose=100 ./hello-world-1

and check the output (useful message might have been compiled out though)

if there is nothing useful, you can

strace -o hw.strace -s 512 ./hello-world-1

then compress hw.strace and upload it.

ggouaillardet avatar Aug 30 '24 04:08 ggouaillardet

env OMPI_MCA_shmem_base_verbose=100 ./hello-world-1

This didn't produce anything relevant.

strace -o hw.strace -s 512 ./hello-world-1

BSDs have ktrace instead. Here is the ktrace dump: https://freebsd.org/~yuri/openmpi-kernel-dump.txt

yurivict avatar Aug 30 '24 05:08 yurivict

51253 hello-world-1 CALL  fstatat(AT_FDCWD,0x1b0135402080,0x4c316d20,0)
 51253 hello-world-1 NAMI  "/tmp/ompi.yv.0/jf.0/2909405184"
 51253 hello-world-1 RET   fstatat -1 errno 2 No such file or directory
 51253 hello-world-1 CALL  open(0x1b0135402080,0x120004<O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC>)
 51253 hello-world-1 NAMI  "/tmp/ompi.yv.0/jf.0/2909405184"
 51253 hello-world-1 RET   open -1 errno 2 No such file or directory

It looks like some directories were not created. what if you mpirun -np 1 ./hello-world-1 instead?

ggouaillardet avatar Aug 30 '24 06:08 ggouaillardet

sudo mpirun -np 1 ./hello-world-1 prints the same error message:

It appears as if there is not enough space for /dev/shm/sm_segment.yv.0.9f060000.0 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

The log doesn't have any mkdir operations, so that "/tmp/ompi.yv.0" was never created.

yurivict avatar Aug 30 '24 07:08 yurivict

well, this is a different message that the one used when opening this issue. And this one is self explanatory.

Anyway, what if you

env OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./helloworld-1

or you can simply increase the size of /dev/shm

ggouaillardet avatar Aug 30 '24 07:08 ggouaillardet

sudo OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./hello-world-1 produces the same error messages.

This message is for a regular user:

$ OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./hello-world-1
[yv.noip.me:88431] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.yv.1001/jf.0/1653407744/sm_segment.yv.1001.628d0000.0 could be created.
> Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88431)
< Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88431)

This message is for root:

# OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./hello-world-1
--------------------------------------------------------------------------
It appears as if there is not enough space for /dev/shm/sm_segment.yv.0.ee540000.0 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

  Local host:  yv
  Space Requested: 16777216 B
  Space Available: 1024 B
--------------------------------------------------------------------------
> Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88929)
< Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88929)

yurivict avatar Aug 30 '24 07:08 yurivict

I see.

try adding OMPI_MCA_btl_sm_backing_directory=/tmp and see how it works

ggouaillardet avatar Aug 30 '24 08:08 ggouaillardet

The error messages disappear when OMPI_MCA_btl_sm_backing_directory=/tmp is used.

yurivict avatar Aug 30 '24 08:08 yurivict

We have seen and responded to this problem many times - I believe it is included in the docs somewhere. The problem is that BSD (mostly as seen on Mac) has created a default TMPDIR that is incredibly long. So when we add our tmpdir prefix (to avoid stepping on other people's tmp), the result is longer than the path length limits.

Solution: set TMPDIR in your environment to point to some shorter path, typically something like $HOME/tmp.

rhc54 avatar Aug 30 '24 12:08 rhc54

[...] a default TMPDIR that is incredibly long [...]

What do you mean by TMPDIR? In our case TMPDIR is just /tmp.

yurivict avatar Aug 30 '24 15:08 yurivict

Indeed, it seems the root cause is something fishy related to /dev/shm

what if you df -h /dev/shm both as a user and root?

ggouaillardet avatar Aug 31 '24 03:08 ggouaillardet

$ df -h /dev/shm
Filesystem    Size    Used   Avail Capacity  Mounted on
devfs         1.0K      0B    1.0K     0%    /dev
# df -h /dev/shm
Filesystem    Size    Used   Avail Capacity  Mounted on
devfs         1.0K      0B    1.0K     0%    /dev

yurivict avatar Aug 31 '24 03:08 yurivict

That's indeed a small /dev/shm.

I still do not understand why running as a user does not get you the user friendly message you get when running as root.

can you ktrace as a non-root user so we can figure out where the failure occurs?

ggouaillardet avatar Aug 31 '24 03:08 ggouaillardet

Here is the ktrace dump for a regular user.

yurivict avatar Aug 31 '24 04:08 yurivict

It seems regular users do not have write access to the (small size) /dev/shm and we do not display a friendly error message about it.

45163 hello-world-1 CALL  access(0x4e3d8d33,0x2<W_OK>)
 45163 hello-world-1 NAMI  "/dev/shm"
 45163 hello-world-1 RET   access -1 errno 13 Permission denied

Unless you change that, your best bet is probably to add

btl_sm_backing_directory=/tmp

to your $PREFIX/etc/openmpi-mca-params.conf

ggouaillardet avatar Aug 31 '24 05:08 ggouaillardet

Is direct access to /dev/shm new in OpenMPI? It used to work fine on FreeBSD.

How does this work on Linux? Is everybody allowed write access to /dev/shm there?

yurivict avatar Aug 31 '24 07:08 yurivict

Access to /dev/shm has fallback in ompi, like here.

Why doesn't this fallback work then? Is it accidentally missing in some cases?

yurivict avatar Aug 31 '24 07:08 yurivict

I believe I've tried everything suggested (and then some) as evidenced by the following interactions:

(ioniser) jabowery@jaboweryML:~/devel/ioniser$ printenv |grep BulkData|grep tmp
OMPI_MCA_shmem_mmap_backing_file_base_dir=/mnt/BulkData/home/jabowery/tmp
btl_sm_backing_directory=/mnt/BulkData/home/jabowery/tmp
TMPDIR=/mnt/BulkData/home/jabowery/tmp
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ tail /home/jabowery/mambaforge/envs/ioniser/etc/openmpi-mca-params.conf

# See "ompi_info --param all all --level 9" for a full listing of Open
# MPI MCA parameters available and their default values.
pml = ^ucx
osc = ^ucx
coll_ucc_enable = 0
mca_base_component_show_load_errors = 0
opal_warn_on_missing_libcuda = 0
opal_cuda_support = 0
btl_sm_backing_directory=/mnt/BulkData/home/jabowery/tmp
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ tail /etc/openmpi/openmpi-mca-params.conf 
btl_base_warn_component_unused=0
# Avoid openib an in case applications use fork: see https://github.com/ofiwg/libfabric/issues/6332
# If you wish to use openib and know your application is safe, remove the following:
# Similarly for UCX: https://github.com/open-mpi/ompi/issues/8367
mtl = ^ofi
btl = ^uct,openib,ofi
pml = ^ucx
osc = ^ucx,pt2pt
btl_sm_backing_directory=/mnt/BulkData/home/jabowery/tmp
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ !p
p ioniser.py
[jaboweryML:34571] shmem: mmap: an error occurred while determining whether or not /mnt/BulkData/home/jabowery/tmp/ompi.jaboweryML.1000/jf.0/121765888/shared_mem_cuda_pool.jaboweryML could be created.
[jaboweryML:34571] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ whoami
jabowery
ioniser) jabowery@jaboweryML:~/devel/ioniser$ touch /mnt/BulkData/home/jabowery/tmp/accesstest.txt
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ ls -altr /mnt/BulkData/home/jabowery/tmp/accesstest.txt
-rw-rw-r-- 1 jabowery jabowery 0 Nov  1 10:51 /mnt/BulkData/home/jabowery/tmp/accesstest.txt
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ df /mnt/BulkData/home/jabowery/tmp
Filesystem      1K-blocks      Used  Available Use% Mounted on
/dev/nvme1n1   1921725720 692366840 1131666768  38% /mnt/BulkData
(ioniser) jabowery@jaboweryML:~/devel/ioniser$ 

jabowery avatar Nov 01 '24 15:11 jabowery

When I compile an run this test program on Arch, I, too, get the error message. When debugging, why the error occurs, I found the following:

  1. A (randomly names) file in /dev/shm is (sucessfully) created.
  2. A backing file in /tmp is about to be created, which is failing and printing the error message

There seems something weird to be going on in shmem_mmap_module.c:

When creating the in-memory-file under /dev/shm, it is locaed directly under /dev/shm (f.e. /dev/shm/sm_segment.kohni-mobil.1000.110f0000.0). enough_space() strips the file-name off the path and checks whether there is enough space in memory, which is fine. After the check the file is created via open() in line 347.

When creating the backing-file in /tmp, it is located in a sub-directory-structure (f.e. /tmp/ompi.kohni-mobil.1000/jf.0/286195712/shared_mem_cuda_pool.kohni-mobil). enough_space() is used again the check, whether there is enough space. But since the function only strips the file name, and the directory structur is not yet created, opal_path_df() in path.c, line 683 fails in the call to statfs(). So, in the end the backing file can never be created.

I guess, either the directory-structure needs to be created before the size check, or the size check needs to determine the base mount point instead of the (sub) directory to check the available size, or the backing file must not be created in a sub-directory...

Best, Jan

jkohnert avatar Dec 28 '24 16:12 jkohnert

Something is off here. This directory name (ompi.kohni-mobil.1000) is something that would be produced by an mpirun for OMPI v4 or earlier - it most definitely is not the name of the top-level directory used by OMPI v5 (which would look like prterun.kohni-mobil.3102.501). Looks like you are attempting to launch an app compiled against OMPI v5 using an earlier mpirun? That will not work.

rhc54 avatar Dec 29 '24 13:12 rhc54

Hm,

currently it looks like this:

jankoh@kohni-mobil untitled $ ldd cmake-build-debug/untitled | grep mpi
        libmpi.so.40 => /usr/lib/libmpi.so.40 (0x00007fc059e00000)
jankoh@kohni-mobil untitled $ pacman -Qo /usr/lib/libmpi.so.40
/usr/lib/libmpi.so.40 ist in openmpi 5.0.6-2 enthalten
jankoh@kohni-mobil untitled $ mpirun --help
mpirun (Open MPI) 5.0.6

Usage: mpirun [OPTION]...

See the mpirun(1) man page or HTML help for a detailed list of command
line options that are available.

Report bugs to https://www.open-mpi.org/community/help/
jankoh@kohni-mobil untitled $ 

I find it a bit strange, that the shared lib is named *40. The cmake-rules for the project are quite simple, too:

find_package(MPI QUIET REQUIRED)
add_executable(untitled main.cpp)
target_link_libraries(untitled PUBLIC MPI::MPI_CXX)

Edit: when I run the program via mpirun, the error vanishes...

jkohnert avatar Dec 29 '24 14:12 jkohnert

The ".40" is just from a libtool convention - the number has nothing to do with the OMPI version itself.

I missed that this is happening only when executed as a singleton. Quick glance at the code shows that OMPI is missing a couple of lines - trivial fix.

rhc54 avatar Dec 29 '24 16:12 rhc54

Give this a try:

diff --git a/ompi/runtime/ompi_rte.c b/ompi/runtime/ompi_rte.c
index 2a2d66bbc3..2ba9483c98 100644
--- a/ompi/runtime/ompi_rte.c
+++ b/ompi/runtime/ompi_rte.c
@@ -69,6 +69,7 @@ opal_process_name_t pmix_name_invalid = {UINT32_MAX, UINT32_MAX};
  * session directory structure, then we shall cleanup after ourselves.
  */
 static bool destroy_job_session_dir = false;
+static bool destroy_proc_session_dir = false;

 static int _setup_top_session_dir(char **sdir);
 static int _setup_job_session_dir(char **sdir);
@@ -995,9 +996,12 @@ int ompi_rte_finalize(void)
         opal_process_info.top_session_dir = NULL;
     }

-    if (NULL != opal_process_info.proc_session_dir) {
+    if (NULL != opal_process_info.proc_session_dir && destroy_proc_session_dir) {
+        opal_os_dirpath_destroy(opal_process_info.proc_session_dir,
+                                false, check_file);
         free(opal_process_info.proc_session_dir);
         opal_process_info.proc_session_dir = NULL;
+        destroy_proc_session_dir = false;
     }

     if (NULL != opal_process_info.app_sizes) {
@@ -1174,6 +1178,7 @@ static int _setup_top_session_dir(char **sdir)

 static int _setup_job_session_dir(char **sdir)
 {
+    int rc;
     /* get the effective uid */
     uid_t uid = geteuid();

@@ -1185,18 +1190,33 @@ static int _setup_job_session_dir(char **sdir)
         opal_process_info.job_session_dir = NULL;
         return OPAL_ERR_OUT_OF_RESOURCE;
     }
+    rc = opal_os_dirpath_create(opal_process_info.job_session_dir, 0755);
+    if (OPAL_SUCCESS != rc) {
+        // could not create session dir
+        free(opal_process_info.job_session_dir);
+        opal_process_info.job_session_dir = NULL;
+        return rc;
+    }
     destroy_job_session_dir = true;
     return OPAL_SUCCESS;
 }

 static int _setup_proc_session_dir(char **sdir)
 {
+    int rc;
+
     if (0 > opal_asprintf(sdir,  "%s/%d",
                           opal_process_info.job_session_dir,
                           opal_process_info.my_name.vpid)) {
         opal_process_info.proc_session_dir = NULL;
         return OPAL_ERR_OUT_OF_RESOURCE;
     }
-
+    rc = opal_os_dirpath_create(opal_process_info.proc_session_dir, 0755);
+    if (OPAL_SUCCESS != rc) {
+        // could not create session dir
+        free(opal_process_info.proc_session_dir);
+        opal_process_info.proc_session_dir = NULL;
+        return rc;
+    }
     return OPAL_SUCCESS;
 }

rhc54 avatar Dec 29 '24 16:12 rhc54

@rhc54 Having applied your patch to openmpi and using the patched openmpi in my small version of this test program, I can confirm, the patch works, the error message vanishes.

Thanks a lot, and best regards, Jan

jkohnert avatar Dec 29 '24 19:12 jkohnert

Hey @rhc54 -- do we need this as a PR?

If so, where does destroy_proc_session_dir get set to true?

EDIT: Never mind -- I see https://github.com/open-mpi/ompi/pull/13003 😄

jsquyres avatar Jan 03 '25 19:01 jsquyres

Hey @rhc54 -- do we need this as a PR?

I don't need it - but you guys do 😄

rhc54 avatar Jan 03 '25 19:01 rhc54

We have seen and responded to this problem many times - I believe it is included in the docs somewhere. The problem is that BSD (mostly as seen on Mac) has created a default TMPDIR that is incredibly long. So when we add our tmpdir prefix (to avoid stepping on other people's tmp), the result is longer than the path length limits.

Solution: set TMPDIR in your environment to point to some shorter path, typically something like $HOME/tmp.

This issue is the first result that turns up on Google (I hit the same error with [email protected] %[email protected] arch=darwin-sequoia-m2) ; a search of the openmpi site returned no hits and the internal doc search isn't smart enough to look for multiple keywords. Should I figure out how to open a PR for the docs, or was this fixed in 5.0.6 (see #13003)? Thanks!

(Also, I guess this issue can officially be closed by #13003 ?)

sethrj avatar Mar 15 '25 12:03 sethrj

I didn't make any changes to the docs, so if there is something missing there, you are welcome to fill the void!

Perhaps @jsquyres can provide you with some direction on how to open the PR?

And yes - we can officially close this now.

rhc54 avatar Mar 16 '25 12:03 rhc54