Pieter Noordhuis
Pieter Noordhuis
These were disabled in #230 because they all fail when running consecutively. When run independently, they appear to pass...
The NVLink cube mesh architecture has partial peer access between devices. Two groups of 4 GPUs have full peer access and every GPU in one group has peer access to...
Stack from [ghstack](https://github.com/ezyang/ghstack): * **#243 Use a single listening socket per device** * #242 Add error class * #241 Add RAII wrapper for socket * #240 Allow deferring functions to...
Stack from [ghstack](https://github.com/ezyang/ghstack): * #243 Use a single listening socket per device * #242 Add error class * #241 Add RAII wrapper for socket * **#240 Allow deferring functions to...
Stack from [ghstack](https://github.com/ezyang/ghstack): * #243 Use a single listening socket per device * #242 Add error class * **#241 Add RAII wrapper for socket** * #240 Allow deferring functions to...
Stack from [ghstack](https://github.com/ezyang/ghstack): * #243 Use a single listening socket per device * **#242 Add error class** * #241 Add RAII wrapper for socket * #240 Allow deferring functions to...
Per @jjlilley in https://github.com/facebookincubator/gloo/pull/237#discussion_r356780531, we can use an `eventfd(2)` to avoid busy-spinning the epoll loop. If we do, we must also update the code that unregisters an fd to either:...
The `notify_send_ready` and `notify_recv_ready` messages used in the tcp backend (and future uv backend, see #195) need better documentation. The protocol how these are sequenced as well.
This is what we do in PyTorch upstream today but it would be good to move the functionality into Gloo. This would be a new type of context that wraps...
The comments mention it is usable for any `#nodes == c * base ^ x`, for any `c >= 1`, `base >= 2`, and `x >= 1`, but in reality...