Nicolas Castet
Nicolas Castet
@njzjz How do you build NCCL? Are you setting the variable `CUDARTLIB` or you use the default (static cudart)?
@sandeep3sai @zwlanpishu @ramkrishna1121 You are seeing this error because your saved_model file was created before the ["scale factor" feature](https://github.com/horovod/horovod/commit/e4554de96100f5a0e8686cd41cf99a6fe8a71e62#diff-3a264e114f1fba673b8380d35e36c85e7f026dc32bf1dc00c390f6f679987019R376) was added to Horovod in v0.20.0, so the Horovod op signature...
> jax.distributed.initiallize() works, without arguments, on several but not all common MPI / Slurm parallel job launchers. From what i remembered, [slurm_cluster.py](jax/_src/clusters/slurm_cluster.py) should work with all Slurm jobs independent of...
Another potential solution for `mpi4py` users is to have `mpi4jax` defines `Mpi4pyCluster` at init time of the `mpi4jax` module since `mpi4jax` already has that hard dependency on `mpi4py` anyway.
Hi Corey, I agree `mpi4jax` and `jax.distributed` may be used together or separate. I really like your `mpi4py` approach to catch all the vendor-specific MPI implementations. Figuring out all the...