gloo issues

[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION and use GLOO_USE_ROCM

1

- Changes to control hipify of CUDA_VERSION to HIP_VERSION - use GLOO_USE_ROCM instead of __HIP_PLATFORM_HCC__ - Adding __HIP_PLATFORM_AMD__ since __HIP_PLATFORM_HCC__ is being deprecated.

pruthvistony

CLA Signed

Catch miscellaneous multiprocessing errors

1

Summary: MultiProc tests does not do multiprocessing error catching thoroughly. This diff plugs some of the holes and includes better logging upon failures. Differential Revision: D26186660

osalpekar

CLA Signed

fb-exported

added <array> header

2

When trying to build the lib on ubuntu with cmake using clang++-11 with libc++, the following error occurs: /home/lib/pytorch/third_party/gloo/gloo/transport/tcp/device.cc:152:39: error: implicit instantiation of undefined template 'std::__1::array' std::array hostname; ^ /usr/lib/llvm-10/bin/../include/c++/v1/__tuple:219:64:...

MatyushinMA

CLA Signed

Update Findrccl.cmake to fix warning

1

This clears the warning: CMake Warning: The package name passed to `find_package_handle_standard_args` (RCCL) does not match the name of the calling package (rccl). This can lead to problems in calling...

cjatin

Add alltoall alltoallv collectives

2

Summary: Add alltoall and alltoallv to Gloo Differential Revision: D21873282

jiayisuse

CLA Signed

fb-exported

Rename RCCL_LIBRARY to RCCL_LIBRARY_PATH

14

to avoid collision with variable in RCCL cmake file. This should fix the error about not finding "-lrccl" in https://github.com/pytorch/pytorch/pull/31341 (now refiled as https://github.com/pytorch/pytorch/pull/34683)

jithunnair-amd

CLA Signed

Build with -Werror

1

Let's see what happens...

pietern

CLA Signed

Reenable multi-process transport tests

1

These were disabled in #230 because they all fail when running consecutively. When run independently, they appear to pass...

pietern

CLA Signed

Take CUDA peer access into account for on-device reduction

6

The NVLink cube mesh architecture has partial peer access between devices. Two groups of 4 GPUs have full peer access and every GPU in one group has peer access to...

pietern

CLA Signed

Gloo in Pytorch for GPU tensor collective communication

For Gloo in Pytorch distributed, as shown in this document https://pytorch.org/docs/stable/distributed.html, will the following code get performance benefits of using CUDA-aware MPI? (e.g., GPU-to-GPU transferring via PCIe while bypassing CPU)...

YukeWang96

gloo
gloo copied to clipboard

Metadata

[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION and use GLOO_USE_ROCM

Catch miscellaneous multiprocessing errors

added <array> header

Update Findrccl.cmake to fix warning

Add alltoall alltoallv collectives

Rename RCCL_LIBRARY to RCCL_LIBRARY_PATH

Build with -Werror

Reenable multi-process transport tests

Take CUDA peer access into account for on-device reduction

Gloo in Pytorch for GPU tensor collective communication

← Metadata

Owner

Metadata

gloo gloo copied to clipboard

Metadata

← Metadata

Owner

Metadata

gloo
gloo copied to clipboard