Lang Xu issues

Results 7 issues of


                                            Lang Xu

torch.cuda.is_available() aborts after module loading omnitrace

Before loading omnitrace: ``` (gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch...

fixed fused_rope naming in JIT + Readme

This PR: - [x] fixed fused_rope naming in JIT compilation - [x] added readme for amd support through fused_kernels @Quentin-Anthony

Validation for Allreduce

- `--single` arg for running single message size, mutually exclusive with `--scan` - validation for allreduce through `--validate`, can be run with `--trials` times.

Pytorch Profiler Integration

This PR enables PyTorch Profiler through `--profile` flag, note that this will have observable overhead with `step`. The output logs are saved under `communication/profiles`

[MPIX Stream] workq-based MPIX P2P Stream test failure

Hi! I am currently testing out [stream.cu](https://github.com/pmodels/mpich/blob/main/test/mpi/impls/mpich/cuda/stream.cu) and have the following output: ``` mpirun -l -np 2 -ppn 2 -genv MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ=1 -genv MPIR_CVAR_GPU_HAS_WAIT_KERNEL=1 -genv MPIR_CVAR_ENABLE_GPU=1 -genv MPIR_CVAR_CH4_RESERVE_VCIS=1 -genv MPIR_CVAR_CH4_NUM_VCIS=2 ./stream...

[MPIX Stream] CUDA event compatible completion semantics

# Summary When using MPIX Stream enqueue APIs with MPICH ch4:ucx in a PyTorch ProcessGroup backend, we observe correctness failures for larger allreduce sizes unless we force a cudaStreamSynchronize after...

MPI_Init_thread multi explicit allocated VCIs Hang with PyTorch

Hi! I have been testing PyTorch ProcessGroup backend with MPIX Stream Extension with a goal of enqueuing collective operations onto a separate device stream to enable fine grained overlapping (just...