Natalia Gimelshein comments

Results 214 comments of


                                            Natalia Gimelshein

nn.CosineSimilarity returns value larger than 1

I understand (although 1.10 didn't implement scipy exactly, it was still using `sum`s, not means, but currently we traded one set of inputs for another set of inputs, so it's...

nn.CosineSimilarity returns value larger than 1

But still, maybe at the end of the day we should just clamp outs so that they are no bigger than one. Gradient computation is a different matter, so still...

Use official CUDAToolkit module in CMake

``` -- Compiler does not support SVE extension. Will not build perfkernels. -- Found CUDAToolkit: /usr/local/cuda-12.8/include (found version "12.8.93") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success --...

Use official CUDAToolkit module in CMake

cudss is not a problem, misdetecting architecture (sm86+PTX instead of sm90) is

Use official CUDAToolkit module in CMake

Concretely, this is wrong in cmake ``` if(CUDA_LIMIT_GPU_ARCHITECTURE AND ITEM VERSION_GREATER_EQUAL CUDA_LIMIT_GPU_ARCHITECTURE) list(GET CUDA_COMMON_GPU_ARCHITECTURES -1 NEWITEM) string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${NEWITEM}") else() string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${ITEM}") endif() ``` as it either...

Use official CUDAToolkit module in CMake

Do you know how this "native" option would work later when we are checking if the build is ok for the current GPU to give a clear error message on...

Use official CUDAToolkit module in CMake

> @ngimel From nvcc documentation: > > ``` > When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will...

[inductor] Cooperative reductions

btw is anyone working on adding cooperative launch to triton? Without it grid sync is unsafe - the threadblocks may not all be resident on the gpu and it'll be...

[inductor] Cooperative reductions

I'd say cooperative launch is needed to turn on by default (could deadlock without it), `tl.load` with `acquire` is nice to have but it's just a perf optimization

[inductor] Cooperative reductions

@jataylo so far this functionality is not turned on by default and is just exercised in tests, you might want to throw an error if it's manually turned on. The...