Nicolas Castet
Nicolas Castet
@yizhang2077 Is it reproducible on your side? I used one of my container images based on commit 9f635ea50de920aa507f486daafba26a5b837574 on a 8xH200 box and could not reproduce the failure with or...
@yizhang2077 Thanks let me try your container image. Might be related: > CUDA RNG operations are permitted, and when using multiple torch.Generator instances within a graph, they must be registered...
The failure did not happen for me on pytorch 2.7 but it does on 2.5. While debugging torch.compile: ``` /usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:725: UserWarning: Graph break due to unsupported builtin None._SimpleCData.__new__. This function...
> UserWarning: Graph break due to unsupported builtin None._SimpleCData.__new__. ... > @nvcastet It is wierd, since pynccl allreduce is also in critical path and is graphable. The message (displayed using...
@ispobock when downloading `nvidia-nccl-cu11`, I see `cu116`: ``` # pip download nvidia-nccl-cu11 Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/ Collecting nvidia-nccl-cu11 Downloading https://developer.download.nvidia.com/compute/redist/nvidia-nccl-cu11/nvidia-nccl-cu11-2022.5.19.tar.gz (16 kB) Preparing metadata (setup.py) ... done Collecting nvidia-nccl-cu116...
@yizhang2077 @zhyncs I went back to the first commit and register the pynccl algather as a pytorch custom op as you suggested. Ideally, it would be nice to get rid...
@ispobock Thanks I fixed the lint.
@EricHallahan Thanks a lot for raising this issue and the thorough discussion! We detect current OMPI with ORTE via the presence of the env var `OMPI_MCA_orte_hnp_uri`: https://github.com/google/jax/blob/main/jax/_src/clusters/ompi_cluster.py#L28-L29 For OpenMPI with...
You can use auto-detection via mpi4py for that. See https://github.com/google/jax/pull/20174
@Fridge003 I think you are correct it should be compatible since trt_allreduce_fusion has its own workspace allocation (unlike the custom-allreduce kernel that registers existing tensors). It means @gracehonv bug on...