Nicolas Castet

Results 21 comments of Nicolas Castet

@njzjz How do you build NCCL? Are you setting the variable `CUDARTLIB` or you use the default (static cudart)?

@sandeep3sai @zwlanpishu @ramkrishna1121 You are seeing this error because your saved_model file was created before the ["scale factor" feature](https://github.com/horovod/horovod/commit/e4554de96100f5a0e8686cd41cf99a6fe8a71e62#diff-3a264e114f1fba673b8380d35e36c85e7f026dc32bf1dc00c390f6f679987019R376) was added to Horovod in v0.20.0, so the Horovod op signature...

> jax.distributed.initiallize() works, without arguments, on several but not all common MPI / Slurm parallel job launchers. From what i remembered, [slurm_cluster.py](jax/_src/clusters/slurm_cluster.py) should work with all Slurm jobs independent of...

Another potential solution for `mpi4py` users is to have `mpi4jax` defines `Mpi4pyCluster` at init time of the `mpi4jax` module since `mpi4jax` already has that hard dependency on `mpi4py` anyway.

Hi Corey, I agree `mpi4jax` and `jax.distributed` may be used together or separate. I really like your `mpi4py` approach to catch all the vendor-specific MPI implementations. Figuring out all the...

Could someone provide the benchmarking command (client-side) used to trigger the crash?

@zui-jiang Re-running with `TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_SHOW_CPP_STACKTRACES=1 NCCL_DEBUG=INFO` set on the server side would give us a bit more info too.

#3709 should fix this bug. Let me know if you encounter an issue after applying it.

Looking. @zhyncs Do you have the repro command to reproduce the CI failure?

@yizhang2077 Still debugging. Can I have another hour-ish? Yes it can work but i have seen people complaining about this issue for non-dp-attention configs too. For the non-dp config paths,...