ucx icon indicating copy to clipboard operation
ucx copied to clipboard

rocm_ipc_md.c:69 UCX ERROR Failed to create ipc

Open jedbrown opened this issue 5 years ago • 12 comments

Describe the bug

I've been following these instructions for ROCm-aware MPI on a Zen2 server node with a Radeon VII and ROCm-3.5.0. The large bar test passed, and the builds all went smoothly, but the OSU test

$ $OMPI_DIR/bin/mpirun -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
[1595887875.771550] [noether:3621053:0]         parser.c:1626 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1595887875.771658] [noether:3621052:0]         parser.c:1626 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# OSU MPI-ROCM Bandwidth Test v5.3.2                                                                                                                                                           
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)                                                                                                                                   
# Size      Bandwidth (MB/s)                                                                                                                                                                   
1                       0.67                                                                                                                                                                   
2                       1.32                                                                                                                                                                   
4                       2.66
8                       5.28
16                     10.73
32                     11.04
64                     11.86
128                    12.51
256                    19.71
512                    38.49
1024                   68.15
2048                  121.26
4096                  158.25
8192                  167.94
[1595887877.188205] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188222] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188226] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188230] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188233] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188235] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188238] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[...many screens of similar output...]
65536                 205.33
[...]

The relevant function is

static hsa_status_t uct_rocm_ipc_pack_key(void *address, size_t length,
                                          uct_rocm_ipc_key_t *key)
{
    hsa_status_t status;
    hsa_agent_t agent;
    void *base_ptr;
    size_t size;

    status = uct_rocm_base_get_ptr_info(address, length, &base_ptr, &size, &agent);
    if (status != HSA_STATUS_SUCCESS) {
        ucs_error("pack none ROCM ptr %p/%lx", address, length);
        return status;
    }

    status = hsa_amd_ipc_memory_create(base_ptr, size, &key->ipc);
    if (status != HSA_STATUS_SUCCESS) {
        ucs_error("Failed to create ipc for %p/%lx", address, length);
        return status;
    }

    key->address = (uintptr_t)base_ptr;
    key->length = size;
    key->dev_num = uct_rocm_base_get_dev_num(agent);

    return HSA_STATUS_SUCCESS;
}

Steps to Reproduce

Install latest versions per https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI and try the intranode test.

# UCT version=1.10.0 revision bae84af
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --prefix=/projects/ucx/ucx --with-rocm=/opt/rocm --without-knem --without-cuda

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • Debian bullseye
    • Linux noether 5.7.0-1-amd64 #1 SMP Debian 5.7.6-1 (2020-06-24) x86_64 GNU/Linux
  • For GPU related issues:
    • GPU type: Radeon VII. From rocminfo:
*******                                                                                        
Agent 9                                                                                        
*******                                                                                        
  Name:                    gfx906                                  
  Uuid:                    GPU-7772516172dc76ba                    
  Marketing Name:          Vega 20 [Radeon VII]                    
  Vendor Name:             AMD                                 
  Feature:                 KERNEL_DISPATCH                     
  Profile:                 BASE_PROFILE                           
  Float Round Mode:        NEAR                                   
  Max Queue Number:        128(0x80)                               
  Queue Min Size:          4096(0x1000)                            
  Queue Max Size:          131072(0x20000)                         
  Queue Type:              MULTI                                   
  Node:                    8                                       
  Device Type:             GPU                                 
  Cache Info:                                                                                  
    L1:                      16(0x10) KB                             
  Chip ID:                 26287(0x66af)                             
  Cacheline Size:          64(0x40)                               
  Max Clock Freq. (MHz):   1802                                
  BDFID:                   41728                                    
  Internal Node ID:        8                                        
  Compute Unit:            60                                       
  SIMDs per CU:            4                                       
  Shader Engines:          4                                   
  Shader Arrs. per Eng.:   1                
  WatchPts on Addr. Ranges:4                                   
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                               
  Wavefront Size:          64(0x40)                            
  Workgroup Max Size:      1024(0x400)                         
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                         
    y                        1024(0x400)                         
    z                        1024(0x400)                         
  Max Waves Per CU:        40(0x28)                            
  Max Work-item Per CU:    2560(0xa00)                         
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                  
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                                
      Alloc Granule:           4KB                                 
      Alloc Alignment:         4KB                                 
      Accessible by all:       FALSE                               
    Pool 2                   
      Segment:                 GROUP                               
      Size:                    64(0x40) KB                         
      Allocatable:             FALSE                               
      Alloc Granule:           0KB                                 
      Alloc Alignment:         0KB                                 
      Accessible by all:       FALSE                               
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                    
      Default Rounding Mode:   NEAR                                
      Default Rounding Mode:   NEAR                                
      Fast f16:                TRUE                                
      Workgroup Max Size:      1024(0x400)                         
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                         
        y                        1024(0x400)                         
        z                        1024(0x400)                         
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)  
      FBarrier Max Size:       32                                  
*** Done ***             

Additional information (depending on the issue)

                Package: Open MPI jeka2967@noether Distribution
                Open MPI: 5.0.0a1
  Open MPI repo revision: v2.x-dev-7987-gc07d77fbf2
   Open MPI release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 5.0.0a1
                  Prefix: /projects/ucx/ompi
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: jeka2967
           Configured on: Mon Jul 27 15:51:12 MDT 2020
          Configure host: noether
  Configure command line: '--prefix=/projects/ucx/ompi' '--with-ucx=/projects/ucx/ucx' '--without-verbs'

jedbrown avatar Jul 27 '20 23:07 jedbrown

Hi @jedbrown We fixed several IPC issues in ROCm 3.7 release. Can you give it a try? It's recommended to uninstall any older ROCm version before installing 3.7.

souravzzz avatar Aug 27 '20 21:08 souravzzz

I've been using ROCM-3.7 successfully for a few days, but unfortunately, this UCX error is still present with fresh rebuilds of ucx, ompi, and the osu benchmark.

$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib'  mpi/pt2pt/osu_bw -d rocm D D                                                
# OSU MPI-ROCM Bandwidth Test v5.3.2                                                                                                                                                           
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)                                                                                                                                   
# Size      Bandwidth (MB/s)                                                                                                                                                                   
1                       0.64                                                                                                                                                                   
2                       1.37                                                                                                                                                                   
4                       2.75                                                                                                                                                                   
8                       5.56                                                                                                                                                                   
16                     10.96                                                                                                                                                                   
32                     11.12                                                                                                                                                                   
64                     11.82                                                                                                                                                                   
128                    12.70                                                                                                                                                                   
256                    19.80                                                                                                                                                                   
512                    40.31                                                                                                                                                                   
1024                   69.53                                                                                                                                                                   
2048                  113.14                                                                                                                                                                   
4096                  155.10                                                                                                                                                                   
8192                  184.66                                                                                                                                                                   
[1598594643.109801] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109821] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109826] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109829] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109834] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000     
[...]

jedbrown avatar Aug 28 '20 06:08 jedbrown

I'm still seeing this issue with ROCm-4.0 and a fresh build of today's ucx (1d22f7486ef4202da30ee811a95ad394b862b9a1) and ompi (8ff2277b7e48b899341f69a9f3f9c9ee7cecf476).

$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib'  mpi/pt2pt/osu_bw -d rocm D D
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.53
2                       0.62
4                       1.24
8                       2.52
16                      5.14
32                     10.28
64                     27.05
128                    13.13
256                    32.39
512                    23.62
1024                   23.42
2048                   23.28
4096                   23.27
8192                   23.23
[1608394399.158674] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158695] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158699] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158702] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[...]

jedbrown avatar Dec 19 '20 16:12 jedbrown

Same issue is also present with today's MPICH 'main' (dac05cf7f9ec1a59e2d917f3da80fc943f378872)

$ $MPICH_DIR/bin/mpiexec -n 2 mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.51
2                       0.64
4                       1.24
8                       2.69
16                      5.46
32                     11.48
64                     28.43
128                    14.47
256                    34.43
512                    26.37
1024                   26.14
2048                   25.80
4096                   25.81
8192                   25.71
[1608692521.150727] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150754] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150758] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150760] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000

jedbrown avatar Dec 23 '20 03:12 jedbrown

Hi @jedbrown, I believe it's a ROCR issue and not UCX. I would like to send you some test programs to debug it further. Is the email @jedbrown.org good to reach?

souravzzz avatar Jan 07 '21 16:01 souravzzz

Yes, thanks.

jedbrown avatar Jan 07 '21 16:01 jedbrown

The root cause was confirmed to be with ROCR support for Radeon VII and not UCX. An internal issue has been raised to resolve this.

souravzzz avatar Jan 18 '21 16:01 souravzzz

@jedbrown, ROCr supports IPC on Radeon VII. The trouble here is it seems you are using the upstream amdgpu driver. This driver does not support IPC on any device. For IPC support you will need to install our DKMS amdgpu driver package. Unfortunately Debian is not a supported OS (Ubuntu is however) so our DKMS package may not install against your kernel.

skeelyamd avatar Jan 22 '21 02:01 skeelyamd

Thanks, is the support going upstream? For various reasons, I'm not going to switch distributions, but we're currently on Linux 5.10 and follow the usual upgrades.

jedbrown avatar Jan 26 '21 15:01 jedbrown

Has there been any update on this?

simonbyrne avatar Jun 02 '22 05:06 simonbyrne

@simonbyrne can you please try with the ucx 1.13.0-rc1 release? We fixed an issue with ipc creation, (although we didn't test on ROCm 3.x, but we did on ROCm 4.5 and ROCm 5.x)

edgargabriel avatar Jun 02 '22 11:06 edgargabriel

Thanks, 1.13-rc1 did fix my issue (I'm not sure what ROCm version it is using)

simonbyrne avatar Jun 02 '22 16:06 simonbyrne