Nicolas Castet comments

Results 5 comments of


                                            Nicolas Castet

ncclCommInitRank failed: unhandled cuda error

@njzjz How do you build NCCL? Are you setting the variable `CUDARTLIB` or you use the default (static cudart)?

[RFE] Removing HVD nodes from SavedModels

@sandeep3sai @zwlanpishu @ramkrishna1121 You are seeing this error because your saved_model file was created before the ["scale factor" feature](https://github.com/horovod/horovod/commit/e4554de96100f5a0e8686cd41cf99a6fe8a71e62#diff-3a264e114f1fba673b8380d35e36c85e7f026dc32bf1dc00c390f6f679987019R376) was added to Horovod in v0.20.0, so the Horovod op signature...

Proposal for generic mpi4py initialization of jax distributed module

> jax.distributed.initiallize() works, without arguments, on several but not all common MPI / Slurm parallel job launchers. From what i remembered, [slurm_cluster.py](jax/_src/clusters/slurm_cluster.py) should work with all Slurm jobs independent of...

Proposal for generic mpi4py initialization of jax distributed module

Another potential solution for `mpi4py` users is to have `mpi4jax` defines `Mpi4pyCluster` at init time of the `mpi4jax` module since `mpi4jax` already has that hard dependency on `mpi4py` anyway.

Proposal for generic mpi4py initialization of jax distributed module

Hi Corey, I agree `mpi4jax` and `jax.distributed` may be used together or separate. I really like your `mpi4py` approach to catch all the vendor-specific MPI implementations. Figuring out all the...