gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Collective communications library with various primitives for multi-machine training.

Results 90 gloo issues
Sort by recently updated
recently updated
newest added

The `notify_send_ready` and `notify_recv_ready` messages used in the tcp backend (and future uv backend, see #195) need better documentation. The protocol how these are sequenced as well.

When importing Gloo (via find_package(Gloo) ), it tells me it was not able to find libglood.a in ${BUILD_DIR}. The file is correctly installed in ${INSTALL_DIR}. This is on Linux x86_64....

This is what we do in PyTorch upstream today but it would be good to move the functionality into Gloo. This would be a new type of context that wraps...

enhancement

The comments mention it is usable for any `#nodes == c * base ^ x`, for any `c >= 1`, `base >= 2`, and `x >= 1`, but in reality...

Reduce and Allreduce ops apply sanity check to enforce non-empty inputs [[here](https://github.com/facebookincubator/gloo/blob/master/gloo/allreduce.cc#L95)]. Allgather returns error code 8 on empty inputs. Does it make sense to support empty inputs in these...

``` [ RUN ] Transport/TransportMultiProcTest.UnboundIoErrors/44 terminate called after throwing an instance of 'gloo::IoException' what(): [/home/lumin/Debian/gloo.pkg/gloo/gloo/transport/tcp/pair.cc:374] writev [::1]:54172: Connection reset by peer /home/lumin/Debian/gloo.pkg/gloo/gloo/test/transport_test.cc:200: Failure Value of: WIFEXITED(result) Actual: false Expected: true...

bug

This is used in a few places in `gloo/transport/**`.

The shutdown code may be a bit too aggressive here. We should have a stress test for the termination scenario where we loop on (context creation, barrier, context destruction) and...

Hello! It seems fine to run Gloo with RoCE, but it seems to be stuck with SoftRoCE. It should just run out of box but it looks like it cannot...

Hi! I'm testing the benchmark program. When I use the --verify flag, I am getting some complaints. what(): [enforce fail at /home/ubuntu/gloo/gloo/benchmark/main.cc:91] T(offset + expected) == input[i]. 2.4e+07 vs 2.4e+07....

bug