Lang Xu
Lang Xu
Before loading omnitrace: ``` (gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch...
This PR: - [x] fixed fused_rope naming in JIT compilation - [x] added readme for amd support through fused_kernels @Quentin-Anthony
- `--single` arg for running single message size, mutually exclusive with `--scan` - validation for allreduce through `--validate`, can be run with `--trials` times.
This PR enables PyTorch Profiler through `--profile` flag, note that this will have observable overhead with `step`. The output logs are saved under `communication/profiles`
Hi! I am currently testing out [stream.cu](https://github.com/pmodels/mpich/blob/main/test/mpi/impls/mpich/cuda/stream.cu) and have the following output: ``` mpirun -l -np 2 -ppn 2 -genv MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ=1 -genv MPIR_CVAR_GPU_HAS_WAIT_KERNEL=1 -genv MPIR_CVAR_ENABLE_GPU=1 -genv MPIR_CVAR_CH4_RESERVE_VCIS=1 -genv MPIR_CVAR_CH4_NUM_VCIS=2 ./stream...
# Summary When using MPIX Stream enqueue APIs with MPICH ch4:ucx in a PyTorch ProcessGroup backend, we observe correctness failures for larger allreduce sizes unless we force a cudaStreamSynchronize after...
Hi! I have been testing PyTorch ProcessGroup backend with MPIX Stream Extension with a goal of enqueuing collective operations onto a separate device stream to enable fine grained overlapping (just...