gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Collective communications library with various primitives for multi-machine training.

Results 90 gloo issues
Sort by recently updated
recently updated
newest added

compiled failed with clang: default constructor of 'Exception' is implicitly deleted because base class 'std::runtime_error' has no default constructor note: default constructor of 'InvalidOperationException' is implicitly deleted because base class...

We're now using CPU + tcp + redis store in our cluster with 250 machines. It works well when the gloo instances number is under 1k in one group, but...

For non-blocking communications, it is important to test if the receive/ send operation is complete. This PR adds these methods to `transport/unbounded_buffer.h` and implements it for TCP transport.

CLA Signed

Hi all, I was going through gloo and I couldn't find any non-blocking collective operations. Is there a way to achieve non-blocking operations ATM? or are there any plans for...

I'm running `benchmark_cuda` with MPI and am setting various NCCL environment variables on the command line. When I specify `-x NCCL_DEBUG=INFO` I don't see any debug info being dumped on...

Hi @pietern, I get this error consistently. Error 5 is [IBV_WC_WR_FLUSH_ERR] = "Work Request Flushed Error", I did some debug, and it is due to the pair destructor code not...

Tested with LLVM 12.0.0 If archive is not included you will get following error: ``` external/gloo/gloo/transport/tcp/device.cc:151:39: error: implicit instantiation of undefined template 'std::array' std::array hostname; ```

CLA Signed

```C++ #include #include #include "gloo/allreduce_ring.h" #include "gloo/reduce_scatter.h" #include "gloo/rendezvous/context.h" #include "gloo/rendezvous/file_store.h" #include "gloo/rendezvous/prefix_store.h" #include "gloo/transport/tcp/device.h" int main(){ int num_elements = 12; int buffer_data[] = {1, 2, 3, 4, 5, 6,...