Osayamen Aimuyo

Results 7 issues of Osayamen Aimuyo

Running `gdrcopy_pplat` fails with `Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" failed at pplat.cu:257`. See complete logs below Click me ```Bash GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00 GPU id:1; name: Tesla...

Hello! I am currently learning CUTLASS and cuBLASdx and I have a question. `multiblock_gemm.cu` only allows K that fits in smem. I believe it can be extended to larger K...

cuBLASdx

## Context See #1631 ## Change Adds integral type trait for `integer_subbyte`

inactive-30d

### Bug Set `CPM_USE_LOCAL_PACKAGES` and call CPMAddPackage(...) for a local package, this [path](https://github.com/cpm-cmake/CPM.cmake/blob/0bc73f41cedb561efe5643826891dcb705c680de/cmake/CPM.cmake#L724) gets called, but after `return()` all the exported env variables are lost and are not available in...

**Describe the bug** [sgemm_sm80](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_sm80.cu) fails to compile when dispatched to non-fp16 gemm_* functions. **Steps/Code to reproduce bug** Change TA = TB = float or any non-fp16 type. **Expected behavior** Code...

bug
? - Needs Triage

**Is your feature request related to a problem? Please describe.** Currently, converting from tf32 to f32 with round to nearest [dispatches](https://github.com/NVIDIA/cutlass/blob/5e497243f7ad13a2aa842143f9b10bbb23d98292/include/cutlass/numeric_conversion.h#L640) to a PTX `cvt` instruction only for sm90. **Describe...

feature request
? - Needs Triage
inactive-30d

I tried running the `all_to_all` benchmark as below ```bash torchrun --nproc-per-node=2 kernels/collectives/all_to_all/benchmark.py ``` It fails with the error `ModuleNotFoundError: No module named '_C'` even after installing with pip like so...